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The Human genome 

A 2.91 -billion base pair (bp) const , i sequence of the euchromatic portion of ' A A using cham-temiinating nucleotide aija- 

the human genome was generated by the whole-genome shotgun sequencing logs (5). In the same year, the first human gene 

method. The 14.8-billion bp DNA sequence was generated over 9 months from was isolated and sequenced (4). In 1986, Hood 

27,271,853 high-quality sequence reads (5.11-fold coverage of the genome) : and co-workers (5) described an improvement 

from both ends of plasmid clones made from the DNA of five individuals. Two in the Sanger sequencing method that included 

assembly strategies— a whole-genome assembly and a regional chromosome attaching fluorescent dyes to the nucleotides, 
assembly— were used, each combining sequence data from Celera and the " " - which permitted them to be sequentially read 

publicly funded genome effort The public data were shredded into 550-bp by a computer. The first automated DNA se- 

segments to create a 2.9-fold coverage of those genome regions that had been . quencer, developed by Applied Biosystems in 

sequenced, without including biases inherent in the :clpning. and assembly California in 1987, was shown to be successful 
procedure used by the publicly funded group! This brought the effective coy-; . .. 

: erage in the. assemblies to eightfold, reducing the number and size of gaps in ^ . " with this new technology (6). From early'.sev v 

the final assembly over what would be obtained with 5.1 1-f old coverage. The quencing of human 1 genomic regions (7), it 

two assembly strategies yielded very similar results that largely agree with became clear that cDNA sequences (which are 

r independent mapping data. The assemblies effectively cover the euchromatic :' : = - re verse-transcribed from RNA) would be es- 

' . regions of the human chromosomes. More than 90% of the genome is in • ' . : sential to annotate and validate gene predictions 

: scaffold assemblies of 100,000 bp or more, and 25% of the genome is in in the human genome. These studies were the 

scaffolds of 10 million bp or larger. Analysis of the genome sequence revealed basis in part for the development of the ex- 

26,588 protein-encoding transcripts for which there was strong corroborating pressed sequence tag (EST) method of gene 

evidenceandanadditional—12,000computationallyderivedgeneswithmouse : identification (8\ which is a random selection, 

~ matches or other weak supporting evidence. Although gene-dense clusters are very high throughput sequencing approach to 
obvious, almost half the genes are dispersed in low G+C sequence separated •> characterize cDNA libraries. The EST method 
by large tracts of apparently noncoding sequence. Only 1.1% of the genome led to the rapid discovery and mapping of hu- 
is spanned by exons, whereas 24% is in introns, with 75% of the genome being man genes (9). The increasing numbers of hu- 
intergenic DNA Duplications of segmental blocks, ranging in size up to chro- : man EST sequences necessitated the develop- 
mosomal lengths, are abundant throughout the genome and reveal a complex ment of new computer algorithms to analyze 
evolutionary history. Comparative genomic analysis indicates vertebrate ex- . large amounts of sequencedata, and in 1993 at 
pansions of genes associated with neuronal function, with tissue-specific de- The Institute for Genomic Research (TIGR), an 
velopmental regulation, and with the hemostasis and immune systems. DNA algorithm was developed that permitted assem- 
sequence comparisons between the consensus sequence and publicly funded -/ > ^ bly and analysis of hundreds of thousands of 
genome data provided locations of 2.1 million sirigle-nuclebtide polymorphisms ■ •*.- • ESTs. Tm^ dgoritlmi permitted 
(SNPs). A random pair of human haploid genomes differed at a rate of 1 bp per tion and annotation of human genes on the basis 
1250 on average, but there was marked heterogeneity jn the level of poly- of 30,000 EST assemblies {10). 
morphism across the genome. Less than 11% of all SNPs resulted in variation in . : The complete 49-kbp bacteriophage lamb- 
proteins, but the task of determining which SNPs have functional consequences ■ da genome sequence was detenniried by a : ; 

remains an open challenge. _ V. . shotgun restriction digest method in 1982 

(77). When considering methods for sequenc- 

Decoding of the DNA that constitutes the derstanding human evolution, the causation • ing the smallpox virus genome in 1991 (12), 

human genome has been widely anticipated of disease, and the interplay between the -.- a whole-genome shotgun sequencing method 

for the contribution it will make toward un- environment and heredity in defining the hu- was discussed and subsequently rejected ow- 

man condition. A project with the goal of ing to the lack of appropriate software tools 

: _ — — = determining the complete nucleotide se- for genome assembly. However, in 1994, 

IhqS^a^^ human genome was-first for- .. when a microbial genome-sequencing project 

mally proposed in 1985 (i). In subsequent was contemplated at TIGlC a whole-genome 
years, the idea met with mixed reactions in shotgun sequencing approach was considered 
the scientjfic community (2): However, in % possible with the TIGR EST assembly algo- 
1990, the Human Genome Project (HGP) was rithm. In 1995, the 1.8-Mbp Haemophilus 
officially initiated in the United States under influenzae genome was completed by a 
the direction of the National Institutes of whole-genome shotgun sequencing method 
Health and the U.S. Department of Energy (75). The experience with several subsequent 
with a 15-year, $3 billion plan for completing genome-sequencing efforts, established the . 
the genome sequence. In 1998 we announced broad applicability of this approach (14, 75). 
our intention to build a unique genome- A key feature of the sequencing approach 
sequencing facility, to detennine the se- used for these megabase-size and larger ge- 
quence of the human genome over a 3-year nomes was the use of paired-end sequences 
period. Here we report the penultimate mile- (also called mate pairs), derived from sub- 
stone along the path toward that goal, a nearly clone libraries with distinct insert sizes and 
complete sequence of the euchromatic por- cloning characteristics. Paired-end sequences 
tion of the human genome. The sequencing are sequences 500 to 600 bp in length from 
was performed by a whole-genome random both ends of double-stranded DNA clones of 
shotgun method with subsequent assembly of prescribed lengths. The success of using end 
the sequenced segments. sequences from long segments (18 to 20 kbp) 
The modem history of DNA sequencing of DNA cloned into bacteriophage lambda in 
began in 1977, when Sanger reported his meth- assembly of the microbial genomes led to the 
od for determining the order of nucleotides of suggestion (16) of an approach to simulta- 
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neously map and sequence the human ge- 
nome by means of end sequences from 150- 
kbp bacterial artificial chromosomes (BACs) 
(17, 18). The end sequences spanned by 
known distances provide long-range continu- - 
. ity across the genome. A modification of the 
BAG end-sequencing (BES) method was ap- 
plied. successfully to complete chromosome 2 ;,\ 
from the Arabidopsis thaliana genome (19). ?:■* 
In 1997, Weber and Myers (20) proposed , ; 
, whole-genome shotgun sequencing of -the * ; 
human genome. Their proposal was not well 
i received (27). However, by early 1998, as 
.less than 5% of the genome had been se- 
quenced, it was clear that the rate of progress . 
in . human genome sequencing - worldwide 
was very slow (22), and the prospects for .-. 
finishing the genome by the 2005 goal were - 
uncertain. 

. In early 1998, PE Biosystems (now Applied 
Biosystems) developed an automated, high- •;• 
throughput capillary DN A : sequencer, subse- : 
quently called the ABI PRISM 3700 DNA ' : 
Analyzer. Discussions between PE Biosystems 
and TIGR scientists resulted in a plan to under- 
take the sequencing of the human genome with : 
the 3700 DNA Analyzer and the whole-genome '■■ 
shotgun sequencing techniques developed at 
TIGR (23). Many of the principles of operation 
of a genome-sequencing facility were estab- . 
lished in the HGR facility (2^). However, the 
facility envisioned for Celera would have a 
capacity roughly 50 times that of TIGR, and 
thus new developments were required for sam- 
ple preparation and tracking and for whole- 
genome assembly. Some argued that the re- 
quired 150-fold scale-up from the H. influenzae 
genome to the human genome with its complex, 
repeat sequences was not feasible (25). The 
Drosophila melanogaster genome was thus 
chosen as a test case for whole-genome assem- 
bly on a large and complex eukaryotic genome. 
In collaboration with Gerald Rubin and the 
Berkeley Drosophila Genome Project, the nu- 
cleotide sequence of the 120-Mbp euchromatic 
portion of the Drosophila genome was deter- 
mined over a 1-year period (26-28). The Dro- 
sophila genome-sequencing effort resulted in 
two key findings: (i) that the assembly algo- 
rithms could generate chromosome assemblies 
with highly accurate order and orientation with 
substantially less than 10-fold coverage, and (ii) 
that undertaking multiple interim assemblies in 
place of one comprehensive final assembly was 

not of value. ■ 

These findings, together with the dramatic 
changes in the public genome effort subsequent 
to the formation of Celera (2P), led to a modi- 
fied whole-genome shotgun sequencing ap- 
proach to the human genome. We initially pro- 
posed to do 10-fold sequence coverage of the 
genome over a 3-year period and to make in- 
terim assembled sequence data available quar- 
terly. The modifications included a plan to per- 
form random shotgun sequencing to -5-fold 



THE HUMAN GENOME 

coverage and to use the unordered and unori-. 
ented BAC sequence fragments and subassem- 
blies published in GenBank by. the publicly- 
funded genome effort (30) to accelerate the , 
project We also abandoned the quarterly an-,;, 
nouncementsin the absence of toterim assem-;f 
>..blies to report. • : . . • ' , 

:^r:.;A11hough,this 'Strategy provided a reason-;- 
;able result very early that was consistent with a ■;. 
whole-genome i shotgun , assembly with/, eight- ,j 
; fold coverage, the human genome sequence; is / 
<not as finished as the Drosophila genome was 
: with an effective 13-fold coverage. However, it 
became clear that even .with this reduced cov- 
erage strategy, Celera could generate an accu- 
: rately ordered and oriented scaffold sequence of 
.ithe human genome in less than 1 year. Human 
genome sequencing was initiated 8 September,. 
,1999 and completed 17. June 2000. -The first 
assembly was completed 25 June 2000, and the 
. assembly reported here was completed 1 Octo- 
ber 2000. Here we describe the whole-genome 
'•random shotgun sequencing effort applied to : 
the human genome. We developed two differ- 
ent assembly approaches for assembling the ~3 
; billion bp that make up the 23 pairs of chromo- : 
somes "of the Homo sapiens genome. Any . Gen- 
. Bank-derived data were shredded to remove 
■ potential bias to the final sequence from chi- :. 
- meric clones, foreign DNA contamination, or 
;misassembled contigs. Insofar as a correctly 
and accurately i assembled genome sequence 
with faithful order and orientation of contigs 
is essential for an accurate analysis of the 
human genetic code, we have devoted a con- 
siderable portion of this manuscript to the 
documentation of the quality of our recon- 
struction of the genome. We also describe.our 
preliminary analysis of the human genetic 
code on the basis of computational methods. . 
Figure 1 (see fold-out chart associated with 
this issue; files for each chromosome can be 
found in Web fig. 1 on Science Online at 
www.sciencemag.org/cgi/content/full/29 1 / 
5507/1304/DC1) provides a graphical over- 
view of the genome and the features encoded 
in it. The detailed manual curation and inter- 
pretation of the genome are just beginning. 

To aid the reader in locating specific an- 
alytical sections, we have divided the paper 
into seven broad sections. A summary of the 
major results appears at the beginning of each 
section. 

i. Sources of DNA and Sequencing Methods 

2 Genome Assembly Strategy and 
Characterization 

3 Gene Prediction and Annotation 

4 Genome Structure 

5 Genome Evolution 

6 A Genome-Wide Examination of 
Sequence Variations 

7 An Overview of the Predicted Protein- 
Coding Genes in the Human Genome 

8 Conclusions 



1 Sources of DNA and Sequencing 
Methods * 

Summary. Jbis section discusses the rationale 
.and ethical rules governing donor selection to 
. ensure ethnic and gender diversity along wjtf, 
■; .the methodologies for .DNA extraction and H-< " 

brary /.construction. .The plasmid library con- 
...struction is the first critical step in shotgun f 
. sequencing! If the DNA libraries are not uni« 
.form in size,. nonchimeric, and do not randomly ' 
: represent the genome, then the subsequent stcpi "•' 

cannot accurately reconstruct .the genome sc. 
.quence. We used automated high-throughput 

DNA sequencing and the computational infra* 
:> stmcture ., to^enable. efficient ; tracking of error* . 
■ mous amounts . of sequence information (27.3 ' 
.million sequence. reads; 14.9 billion bp of sc* 

quence). Sequencing .and tracking from ' both 
: ends of plasmid clones from 2-, 10-, and 50-kbp 
. libraries ; were '.essential to the computational 
, reconstruction of the genome. Our evidence 
: indicates that the accurate, pairing. rate of . end 

sequences was greater than 98%. : 

: Various policies of the United States and the 
.World Medical Association; specifically the .. 
. Declaration of Helsinki; offer recommenda- 
\ lions for conducting experiments with human 
subjects.:, We convened, an Institutional Re-, 
.view Board. (IRB) (57) that helped us estab- 
lish the protocol for obtaining and using hu- 
: man DNA and the informed consent process 
used to enroll research volunteers for the 
DNA-sequencing studies reported here. We 
adopted several steps and procedures to pro- 
tect the privacy rights and confidentiality of 
the research subjects (donors). These includ- 
ed a two-stage consent process, a secure ran- 
: dom alphanumeric coding system for speci- 
. mens and records, circumscribed contact with 
the subjects , by . researchers, and options for 
off-site contact of donors. In addition, Celera 
applied for and received a Certificate of Con- 
fidentiality from the Department of Health 
and Human Services. This Certificate autho- 
rized Celera to protect the privacy of the 
individuals who volunteered to be donors as 
provided in Section 301(d) of the Public 
Health Service Act 42 U.S.C."241(d). 

Celera and the IRB believed that the ini- 
tial version of a completed human genome 
should be a composite derived from multiple 
donors of diverse ethnic backgrounds Pro- 
spective donors were asked, on a voluntary 
basis, to self-designate an ethnogeographic 
category (e.g., African-American, Chinese, 
Hispanic, Caucasian, etc.). We enrolled 21 
donors (32). 

Three basic items of information from 
each donor were recorded and linked by con- 
fidential . code to the donated sample: age, 
sex, and self-designated ethnogeographic 
group. From females, -130 ml of whole, 
heparinized blood was collected. From males, 
-130 ml of whole, heparinized blood was 
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collected, as well as five specimens of se-'*" 
collected over a 6-week period. Permit 
lymphoblastoid cell lines were created by 
Epstein-Barr virus immortalization. DNA 
from five subjects was selected for genomic 
DNA sequencing: two males and three fe- 
males—one African-American, one Asian- 
Chinese, one Hispanic-Mexican, and two 
Caucasians (see Web fig. 2 on Science Online 
at www.sciencemag.org/cgi/content/291/5507/ 
r 1304/DCl). 'The decision of whose; DNA to 
. ; sequence was based on a complex mix of fac- 
• v.'tors, including the goal of achieving diversity as 
\well as technical issues such as the quality of 
the DNA libraries and availability of immortal- ; 
ized cell lines. - a.. 

^1/1 Library construction and 
sequencing 

: Central to the whole-genome shotgun sequenc- . 

ing process is preparation of high-quality plas- . 
: mid libraries in a variety of insert sizes so that '.: 
pairs of sequence reads (mates) are obtained, 
one read from both ends of each plasmid insert. 
High-quality libraries have an equal representa- 
; . ton of all parts of the genome, a small number 
: V of clones without inserts, and no contamination 
' from such sources as the mitochondrial genome ■ 
: and Escherichia coli genomic DNA. DNA from 
each donor was used to construct plasmid librar- 

■ : ies in one ormore of three size classes: 2 kbp, 10 « 
kbp, and 50 kbp (Table 1) (55). 

In designing the DNA-sequencing pro- 
•cess, we focused on developing a simple 

■ system that could be implemented in a robust 
and reproducible manner and monitored ef-' . 
fecti vely (Fig. 2) (34). 

Current sequencing protocols are based on > 

Table 1. Celera-generated data input into assembly. 
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the dideoxy sequencing method (35), which 
typically yields only 500 to 750 bp of sequence 
per reactioa This limitation on read length has 
made monumental gains in throughput a pre- 
requisite for the analysis of large eukaryotic 
genomes. We accomplished this at the Celera 
facility, which occupies about 30,000 square 
feet of laboratory space and produces sequence 
data continuously at a rate of 175,000 total 
■ reads per day. .The DNA-sequencing facility is 
.* ^supported by a lugh-performance computation^: 
>i : al facility (36). \ /"ij.v/^-A.*^. 
-i^ :. r:The process 'fbr DNA sequencing was mod- 
.» ular .by design and automated. Intermodule 
> sample -backlogs -allowed -four principal 
^modules to -operate independently: ' (i) li- 
r brary. transformation,- platmg, and colony 
picking; (ii)'vDNA template preparation;, 
(iii) dideoxy . sequencing reaction set-up : 
' : and purification; and : (iv) sequence deter- 
Jmination with the ABI PRISM 3700 DNA 

• Analyzer/ Because the inputs and outputs 
of each module have been carefully : 

• matched and sample backlogs are continu- ■ 
ously managed, sequencing has proceeded 
without a single day's interruption since the 

/initiation of the Drosophila project in May 
1999. The 'ABI 3700 is a fully automated 
capillary array sequencer and as such can 
be operated • with a . minimal amount - of 
hands-on time, currently estimated at about : 
15 min per day. The capillary system also 
facilitates correct . associations of sequenc- 
ing traces with samples through the elimi- 
nation of manual sample loading and lane-, 
tracking errors associated with slab gels. 
About 65 -production staff were hired and 
trained, and were rotated on a regular basis 



rough the four production modules. A 
central laboratory information management 
system (LIMS) tracked all sample plates by 
unique bar code identifiers. The facility was 
supported by a quality control team that per- 
formed raw material .and in-process testing 
and a quality assurance group with responsi- 
v .? bilities including document .control, valida- 
• vtipn, and auditing.of the facility. Critical to 
v the success of the scale-up was the validation 
; % of ' all . software and .instrumentation, before . 
: : ^implementation,- and. production-scale testing 

* *df any process changes.' xU-^i: : . .X^;,./ 

VI.2 Trace processing 

> .' An automated trace-processing pipeline has 
.'been developed to process.each sequence file ..' 

■ (57). After quality and vector.trimming, the. - 
:> average trimmed -sequence length. was 543 
i bp,.: and the . sequencing, accuracy , was . expo-. 

nentially distributed with a mean of 99.5% ' 

■ and with less man /1 in 1000 reads being less 
than 98% accurate (26). "Each trimmed se- 
quence was screened.for matches to contam- 
inants including sequences of vector alone, E. 

• coli genomic DNA, and human mitochondri- . 
al DNA. The entire read. for any sequence , 
with a significant match to a contaminant was . 
discarded. A total of 713 reads matched ".E^ 
coli genomic DNA and. 2114 reads matched ... 
the human mitochondrial genome. .; .'. 



1.3 Quality assessment and, control 
The importance of the .base-pair, level ac-. 
^curacy of the sequence data increases as the.;', 
* size and repetitive nature of the genome to 
be sequenced- increases. / Each sequence ■ 
read must be placed uniquely in the ge- \. 



No. of sequencing reads 



Fold sequence coverage 
(2.9-Gb genome) 



Fold done coverage 



Insert size* (mean) 
Insert size* (SD) 
% Matesf 



Individual 



Number of reads for different insert libraries 



•Insert size and SO are calculated 



A 
B 
C 
D 
F 

Total 
A 
B 
C 
D 
F 

Total 
A 
B 
C 
D 
F 

Total 
Average 
Average 
Average 



2 kbp 


10 kbp 


50 kbp 


0 
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2,767,357 


11.736,757 


7,467,755 


66,930 
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952,523 


1,046,815 
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0 
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. 0 


0.52 
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: 1.40 


0.01 


0.16 


1.17 


0 


0.18 


0.20 


0 


0 


• 0.28 


0 


2.54 


2.04 


0.53 


0 


0 


1839 


2.96 


11.26 


0.44 


022 


133 


0 


024 


1.58 


0 


0 


2.26 


0 


3.42 


16.43 


18.84 


1,951 bp 


10.800 bp 


50,715 bp 


6.10% 


8.10% 


14.90% 


74.50 


80.80 


75.60 



Total 



2,767.357 
19,271,442 
1.735,109 
1,999,338 
1,498,607 
27,271,853 
0.52 
3.61 
032 
037 
0.28 
5.11 
1839 
14.67 
1.54 
1.82 
226 
38.68 



Total number of 
base pairs 



1,502,674,851 
10,464,393,006 
942,164,187 
1.085,640,534 
813.743,601 
14,808,616,179 



from assembly of mates on contigs. f% Mates is based on laboratory tracking of sequencing runs. 
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nome, and even a modest error rate can 
reduce the effectiveness of assembly. In 
addition, maintaining the validity of mate- 
pair information is absolutely critical for. 
the algorithms described below. Procedural 
controls, were established for maintaining 

• the validity of sequence mate-pairs as se- 
quencing reactions proceeded through the: 
process, including strict rules built into the 

ILIMS. The accuracy of sequence data pro- 
duced by the Celera process was validated. 

; in the course of ' the Drosophila genome- 
project (26). By collecting data for the. 
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. entire human genome in a single facility, 
we were able to ensure uniform Equality 
r standards, and the cost advantages associat- .;• 
.:. ed with automation, an economy of scale,;. 
':. and process consistency. 

. • . ■ • • ■ . -. Vjt >t "■^f 

v 2 Genome Assembly Strategy and : > 
* Characterization v- vv** 

; z-Summary. Vft describe in. this section the two ♦ 
c approaches that we used to assemble ,the ge-. ; 
nome. One method involves the computational -i 
combination of all sequence reads with shred- / 
ded data from GenBank to generate, an indepen- • 



,? dent, nonbiased view of the genome. The sec- 
.bnd approach involves clustering all of the frag- 
ments to a region or chromosome on the basis 
, of .mapping ■ information. The clustered data 
■■. were then shredded and subjected to computa- 
tional assembly.: Both approaches' provided es- 
: .\sentially the same reconstruction of assembled 
r/DNA sequence* with proper.order and orienta- 
t ;tion.> The? second : memod vprdvided . slightly 
i greater sequence coverage ;(fewer gaps) and 
;j ;,was the principal sequence used for the analysis 
\ phase; In addition,- we . document the complete- 
/, ness and .correctness of. this;assembly process 



Potential Entry Points 



Human Samples 

[Medical Affairs] 



-sample screening 



: Tissue Samples 
[DNA Resources] 



Process Management <y 



.:. Workflow Process 



DNA/RNA (External) 

[DNA Resources] . 


QC: s* 


ze&c 


xjncentration 




il 







Libraries 

[DNA Resources] 




Potential Exit Points 



. QC: size an 


d.clari 




, v. . DNA/RNA.; ./, , 

• ' ': [DNA Resources] 















QC: insert size, ' 
; librarv complexity' 



QC: 



titer & functional test ^ j^Bj^Sequencing^gj 



Fluorescently Labeled 
DNA 

[Pre-Sequencing Lab] 



<; Libraries 

" [DNA Resources] 



pFluorescentlylabeled 
•DNA 

-v [Pre-Sequencing lab] 



Trace Files [UNIX] 
[Sequencing Lab] 



- validate trace files 

- load QCDS quality info 



flS^uencirig Lf 



. QC: monitor statistical 
summary data w 


Trace Files [NT] 

[Sequencing Lab] 















vector & contaminant 



- ^%ir%st-Sequen.C!ng : ^ sc reening ■ 



External Fragments 

[Content Systems - EDA] 



External & Trimmed 
Fragments 

[Content Systems] 



Proto I/O Files 
[Content Systems] 




Trimmed Fragments 

[Content Systems] 



7 syntax, duplicates & 
f aualitv values w 


Proto I/O Files 

[Content Systems] 






■'.iyly.'.' 









Assemblies 

[IR/CT] 



Fig. 2. Flow diagram for sequencing pipeline. Samples are received, 
selected, and processed In compliance with standard operating proce- 
dures, with a focus on quality within and across departments. Each 
process has defined inputs and outputs with the capability to exchange 



samples and data with both internal and external entities according to 
defined quality guidelines. Manufacturing pipeline processes, products, 
quality control measures, and responsible parties are Indicated and are 
described further in the text. 
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and provide a comparison to the public geiy ^ : 
sequence, which was reconstructed largel) ' J' 
an independent BAC-by-BAC approach. Our 
assemblies effectively covered the euchromatic 
regions of the human chromosomes. More than 
90% of the genome iwas in scaffold assemblies 
. of 100,000 bp or greater, and 25% of the ge- 
nome was in scaffolds of 10 million bp or . 
larger. 

. Shotgun [Sequence assembly is a classic : 
• ■ .example of an inverse problem: given a "set ': 
; - of .reads, randomly -sanipled from; a target 
" sequence, reconstruct the order and the po- -'. 

sition of those reads in the target. Genome u 
^ assembly algorithms developed for Dro- ■■ 
sophila have now been extended to assemble * 
the ~25-fbld larger human genome. Celera as- - 
semblies consist of a set of contigs that are j 
; ordered and oriented into scaffolds that are then 
■ mapped to chromosomal locations by using / 
• : known markers. The contigs consist of a col-. . 
lection of overlapping sequence reads that pro- 
vide a consensus reconstruction for a contigu- . 
ous interval of the genome. Mate pairs are a . 
.. central component of the assembly strategy. . 
They are used to produce scaffolds in which the . 
size of gaps between consecutive contigs is 
known with reasonable precision. This is ac-. 
complished by observing that a pair of reads, 
one of which is in one contig, and the other of 
which is in another, implies an orientation and . . 
distance between the two contigs (Fig. 3). Fi- .* 
nally, our assemblies did not incorporate all , 
reads into the final set of reported scaffolds. 
This set of. unincorporated reads is termed ; 
"chaff," and typically consisted of reads from 
within higjily repetitive regions, data from other ' 
organisms introduced through various routes as 
found in many genome projects, and data of 
poor quality or with untrimmed vector. 
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2.1 Assembly data sets 

We used two independent sets of data for our 
assemblies. The first was a random shotgun 
data set of 2727 million reads of average length 
543 bp~ produced - at Celera.- This consisted 
largely of mate-pair reads from 16 libraries 
constructed from DNA samples taken from five 
different donors. libraries with insert sizes of 2, ■ 
10, and 50 kbp were used By looking at how- 
mate pairs from a library were positioned in 
- known sequenced stretches of the genome, we , 
" were able. to cliaracterize tne range of insert ; 
^sizes. in each library and determine a mean and *. 

• standard deviation.' Table 1 details the number , 
uof reads, sequencing coverage, and clone cov-^ 
* • erage/achieved by the data set The clone cov- .. 
\ ;erage is the coverage of the genome in cloned < 
■■'DNA,. considering the entire insert of each\ ; 

■ clone that has sequence from both ends; The 
clone: coverage provides a measure of the 
amount of physical DNA coverage of the ge- 

■ nome: Assuming a genome size of 2.9 Gbp, the 
Celera trimmed sequences gave a 5. IX cover- 
age of the genome, and clone coverage was 
3.42X, 16.40X, and 18.84X for the 2-, 10-, and 

•^0-kbpr libraries, respectively, for a total of 
3 8.7X clone coverage. . •>*. 

• : .:The second data set was from the publicly- 
funded Human Genome Project (PFP) and is 
primarily derived from BAC clones (30). The 

" BAC data input to the assemblies came from a 
download of GenBank on 1 September 2000 
(Table 2) totaling 44433 Mbp of sequence. 
The data for each BAC is deposited at one of 
four levels of completion. Phase 0 data are a set . 
of generally ^unassembled sequencing = reads' 
-from a very light shotgun of the BAC, typically * 
less than IX. Phase 1 data are unordered as- 
semblies of contigs, which we call BAC contigs 
or bactigs. Phase 2 data are ordered assemblies 
of bactigs. Phase 3 data are complete BAC 



( ruences. In the past 2 years the PFP has, 

* *ocused on a product of lower quality and com- 
pleteness, but on a faster time-course, by con- 
centrating on the production of Phase 1 data 

- i .from a 3X to 4X. light-shotgun of each BAC 

* clone. ........ 

;" > • We screened ; the bactig sequences for con- 
v.taminants by .using the BLAST algorithm- 
V against three data sets: (i) vector sequences 

in Univec core (38), filtered for a 25-bp 
.-y match at_ 98% sequence identity at the ends 
-••of the sequence. and a 30-bp match internaiv . 

- to the sequence;v(ii) the nonhumanvportioh H 
r; of the High -Throughput Genomic i(HTG) 

'r Seqences division of - GenBank (39), fil- 
tered at 200 bp at 98%; and (iii) the non- 
. : 7 redundant nucleotide sequences from Gen- • 
? .-Bank without primate and human virus en- 
> tries, filtered at 200 bp at 98%. -Whenever 
/ 25 bp or more of vector was found within .. 
i:\ 50 bp of the end of a . contig, the tip. up to : * 
-the matching vector was excised. Under- 
these criteria we removed 2.6 Mbp of pos- . 
sible contaminant- and vector ; from the 
Phase 3 data, 61.0 Mbp from the Phase 1 
and 2 data, and 16.1 Mbp from the Phase O 
data (Table 2). This left us with a total of 
4363.7 Mbp of PFP sequence data 20% 
finished, 75% rough-draft (Phase 1 and 2), 
and 5% single sequencing reads (Phase 0). 
v An additional 104,018 B AC. end-sequence, 
mate pairs were also downloaded and in- 
cluded in the data sets for both assembly - 
processes (18). 
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Fig. 3. Anatomy of whole-genome assembly. Overlapping shredded bactig fragments (red lines) and 
internally derived reads from five different individuals (black lines) are combined to produce a 
contig arid a consensus sequence (green line). Contigs are connected into scaffolds (red) by using 
mate pair information. Scaffolds are then mapped to the genome (gray One) with STS (blue star) 
physical map information. 



2.2 Assembly strategies 

v Two different approaches to assembly were _ 
pursued. The first was a whole-genome as- ; 
sembly process that used Celera data and the 
PFP data in the form of additional synthetic 
shotgun data, and the second was a compart- 
mentalized assembly process that first parti- 
tioned the Celera and PFP data into sets 
localized to large chromosomal segments and 
then performed ab initio shotgun assembly on 
each set. Figure 4 gives a schematic of the 
overall process flow. 

For the whole-genome assembly, the PFP 
data was first disassembled or "shredded" into a 
synthetic shotgun data set of 550-bp reads that 
form aperfect 2X covering o£the bactigs. This 
resulted in 16.05 rnillion "faux" reads that were 
sufficient to cover the genome 2.96X because 
of redundancy in the BAC data set, without 
mcorporating the biases inherent in the PFP 
assembly process. The combined data set of 
43.32 million reads (8X), and all associated 
mate-pair information, were then subjected to 
our whole-genome assembly algorithm to pro- 
duce a reconstruction of the genome. Neither 
the location of a BAC in the genome nor its 
assembly of bactigs was used in this process. 
Bactigs were shredded into reads because we 
found strong evidence that 2.13% of them were 
misassembled (40). Furthermore, BAC location 
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• information was ignored because some BACs 'at least 2.2% of the BACs contained sequence 
were not correctly placed on the PFP physical data that were not part of the given BAC 
; map and because we found strong evidence that v • possibly as a result of sample-tracking errors 

Table 2. CenBank data input into assembly. 



.• A Completion phase sequence ' ■ 



Center 



Statistics 



V 0 



1 and 2 



3 



Whitehead Institute/ 

MIT Center for 
. Genome Research, : 

• USA « ■• ; t V 



Washington University, 
USA 



Baylor College of 
Medicine, USA 



Production Sequencing 
Facility, DOE Joint.;. 
Genome Institute, 
USA 



The Institute of Physical 
and Chemical 
Research (RIKEN), 
Japan 



Sanger Centre, UK 



Others* 



All centers combinedf 



Number of accession records 

Number of contigs ; : ; 

Total base pairs 
: Total vector masked (bp) 4 .'■> 
• Total contaminant masked ■ ■ 

. (bp) • 

Average contig length (bp) 

r Number of accession records 
Number of contigs 
Total base pairs 
Total vector masked (bp) 
Total contaminant masked 

(bp) 

Average contig length (bp) 

Number of accession records 
Number of contigs 
Total base pairs 
Total vector masked (bp) 
Total contaminant masked 
(bp) 

Average contig length (bp) 

Number of accession records 
Number of contigs . 
Total base pairs 
Total vector masked (bp) 
Total contaminant masked 
(bp) 

Average contig length (bp) 

Number of accession records 

Number of contigs 

Total base pairs 

Total vector masked (bp) 

Total contaminant masked (bp) 

Average contig length (bp) 

Number of accession records 

Number of contigs 

Total base pairs 

Total vector masked (bp) 

Total contaminant masked (bp) 

Average contig length (bp) 

Number of accession records 
Number of contigs 
Total base pairs 
Total vector masked (bp) 
Total contaminant masked 
(bp) 

Average contig length (bp) 

Number of accession records 
Number of contigs 
Total base pairs 
Total vector masked (bp) 
Total contaminant masked 
(bp) 

Average contig length (bp) 



■': 2,825 6,533 V; ■ ."363 

- 243,786 ; .138,023 " ! 363 
194,490,158 1,083,848,245 ; . 48,829,358 
* ; 1.553,597 ' - 875,618 ^ . ' 2,202 
' 13,654.482 : : 4.417,055 ^ -98,028 



.798 . 

,.2,127, 
1,195,732 
' 21,604 
. 22.469 

.•- 562 • 

*■ -o < 

0 

• . 0 
0 

0 



;, 7,853 

. 3,232, 
'61,812 

561.171.788i 
: 270,942 
1,476,141 



134.516 

i, . ..1,300 

i,3oo 

164.214.395 
8.287 
':« -469.487 

• : 9,079 : .126,319 

1,626' --Ci* 363 
44,861 363 
265,547,066 49,017.104 
- 218,769 ... . 4,960 
, 1,784.700 , 485.137 
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754 


7.052 


34,938 


754 


•8,680,214 


■■) 294,249,63.1 


60,975.328 v 


.22,644 


. : 162,651 


• 7,274 V 


665,818 


. 4,642,372" 


v?-;i 118387': « 


1,231 


8,422 


80,867 


0 


1,149 


300 


0 


25,772 


300 


0 


182,812,275 


20.093,926 


0 


203.792 


2,371 


0 


308.426 


27.781 


0 


7,093 


66,978 


0 


4,538 


2,599 ' 


0 


74,324 


2,599 ; 


0 


689,059,692 


246,118,000 


0 


427,326 


25.054 


0 


2,066,305 


374.561 


0 


9,271 


94,697 


42 


1,894 


3.458 


5,978 


29,898 


3.458 


5,564,879 


283.358.877 


246.474.157 


57,448 


279,477 


32,136 


575,366 


1,616,665 


1.791,849 


931 


9,478 


71,277 


3.021 


21,015 


9.137 


258.943 


409,628 


9,137 


209,930,983 


3,360,047,574 


835.722,268 


1,655,293 


2,438.575 


82,284 


14,918,135 


16.311,664 


3,365,230 


811 


8.203 


91,466 



•Other centers contributing at least u.i% or me sequence mwuue. . . 

Cenomanalyse Cesellschaft fuer Biotechnologische Forschung mbH; Genome ^ era ^« °?P??J^. " 
Chinese Academy of Sciences; Institute of Molecular Biotechnology; Keio University School of MjMjjJ^T^ 
Uvermore National Laboratory; Cold Spring Harbor laboratory; Los Alamos National Laboratory; M"'^™^™ 
Molekulare. Cenetik; japan Science and Technology Corporation; SUnford Unrvers.ty; The Institute ^non* 
Research; The Institute of Physical and Chemical Research, Gene Bank; The Untawty of ° k ^*^^ t ^ T ^ 
Southwestern Medical CenteV. University of Washington. fThe 4.405.700.825 bases contributed by aU centers were 
shredded Into faux reads resulting In 236X coverage of the genome. 



" ; ,:(see below), in short, we performed a true, ab 
rinitio whole-genome assembly in which ut 
-took the ' expedient of deriving additional sc. 

quence coverage, but not mate pairs, assembled 
. bactigs/or genome locality, from some cxicr- 

v- ■ . nally. generated data. 

rv > V- In the compartmentalized. shotgun assembly ; 
■ (CSA), Celera and: PFP data , were partitioned 
•*. "into the largest possible chromosomal segments 
jtfV '-or "comp6nehts" that could be determined' with 
*;confidence;and then shotgun assembly was ap- 
plied to each partitioned ..subset, wherein' the . 
bactig data were again shredded into faux rcadj 
-v-tb ensure an independent ab initio, assembly of 
'the component By subsetting the data in this 
• ' way,* me' overall -computational, effort was rc- 
. ' Cduced and the effect of interchrofnospmal dupli- 
•. ^' cations was ameliorated. This also resulted in a 
reconstruction of the genome that was relatively 
independent of the whole-genome assembly re- 
sults so that the two assemblies could be com- 
■r pared for consistency. -The quality of the parti- 
. iv tioning ^into ^components .was .crucial so .that 
:•• different-genome regions were not mixed to- 

• gethert'We constructed components from (i) the 
: : longest ^scaffolds of the sequence from each 

■<<\\ BAG and (ii) assembled scaffolds of data unique 
to Celera's data set The BAC assemblies were 
obtained by a combining assembler that used the 
i bactigs arid the 5X Celera data mapped to those 
;bactigs as -input TrnYeffort ^ as 
van mterim' step solelybecause the'more accurate 

• and complete foe scaffold Tor a given sequence 
'stretch,*the more accurately one can tile these 
scaffolds into contiguous components on die 
basis of sequence overlap and mate-pair infor- 
mation. We further visually inspected and en- 
rated the scaffold tiling of the components to 
further increase its accuracy. For the final CSA 
assembly, all but the partitioning was ignored, 
and an independent, ab initio reconstruction of 
the sequence in each component was obtained 
by applying our whole-genome assembly algo- 
rithm to the partitioned, relevant Celera data and 
the shredded, faux reads of the partitioned, rel- 
evant bactig data. 

2.3 Whole-genome assembly 

The algorithms used for whole-gcnomc as- 
sembly (WGA) of the human genome were 
enhancements to those used to produce in 
sequence of the Drosophila genome rcponcc 
in detail in (28). 

Jhc WGA assembler consists of a pip^"* 
composed of five principal stages: Screenc . 
Overlapped Unitigger, Scaffolder, and Kept. 
Resolver, respectively. The Screener lin • 
and marks all microsatellite repeats wiW 
than a 6-bp element, and screens out ■ 
known interspersed repeat elements, inc 
ing Alu, Line, and ribosomal DNA. Mj» 
regions get searched for overlaps where 
screened regions do not get searched, W » < 
be part of an overlap that involves unscrct. 
matching segments. 
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The Overlapper . compares every L 
against every other read in search of complete 
end-to-end overlaps of at least 40 bp and with 
no more than 6% differences in the match. 
Because , all data are scrupulously vector- 
trimmed, the. Overlapper can insist on com- 
plete overlap matches. Computing the set of 
all overlaps.took roughly. 10,000 CPU hours 
with a suite of four-processor Alpha SMPs 
with 4 gigabytes of RAM. This .took 4 to 5 
: days in elapsed time ;with '40 such machines 
operating in parallel.^. : : .,• . ; , 

. . . Every overlap computed above is statisti- 
cally a l-in-10 17 event and thus not a coinci- 
dental event -What makes assembly combi- 
: natorially difficult is that while many over- * 
•■laps .are actually sampled from overlapping , 
regions of the genome, and thus imply, that \ 
the sequence reads should be assembled to- 
-gether,.eyen more overlaps are actually from - 
two distinct copies of a low-copy repeated 
element not screened above, thus constituting 
an error if put together. We call the former. • 
"true overlaps" and the latter "repeat-induced 
overlaps." The assembler must avoid choos- . 
ing repeat-induced overlaps, especially early >, 
in the process. 

We achieve this objective in the Unitig- . 
ger. We first fmd all assemblies of reads that 
appear to be uncontested with respect to all - 
other reads. We call the contigs formed from 
these subassemblies unitigs (for uniquely as- 
sembled contigs). Formally, these unitigs are 
the uncontested interval subgraphs of -the 
graph of all overlaps (42). Unfortunately, al- 
though empirically many of these assemblies 
are correct (and thus involve only true over- 
laps), some are in fact collections of reads 
from several copies of a repetitive element 
that have been overcollapsed into a single 
subassembly. However, the overcollapsed 
unitigs are easily identified because their av- 
erage coverage aepth is tbVlugh to be con- 
sistent with the overall level of sequence 
coverage. We developed a simple statistical 
discriminator that gives the logarithm of the 
odds ratio that a unitig is composed of unique 
DNA or of a repeat consisting of two or more 
copies. The discriminator, set to a sufficiently 
stringent threshold, identifies a subset of the 
unitigs that we are certain are correct In 
addition, a second, less stringent threshold 
identifies a subset of remaining unitigs very 
likely to be correctly assembled, of which we 
select those that will consistently scaffold 
(see below), and thus are again almost certain 
to be correct We call the union of these two 
sets tMinitigs. Empirically, we found from a 
6X simulated shotgun of human chromosome 
22 that we get U-unitigs covering 98% of the 
stretches of unique DNA that are >2 kbp 
long. We are further able to identify the 
boundary of the start of a repetitive element 
at the ends of a U-unitig and leverage this so 
that U-unitigs span more than 93% of all 
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singly interspersed Alu elements and other 
100-to 400-bp repetitive segments. 

The result of running the Unitigger.was 
thus a set .of correctly assembled subcontigs 
covering an estimated 73.6% of the human 
genome. . The Scaffolder then proceeded to 
use mate-pair information to link these to- 
gether into scaffolds,- When there are two or 
more mate pairs that imply that a given pair 
of -U r unitigs ^are :. at- a : certain ^distance and > 
: ". Qrientatipn.^witn.: respect -to ieach otheryvthe.. 
* probability of . this .being V. wrong • is ; again 
. - roughly 1 in 1 0, 1 ?, assuming that mate pairs : 
. -are false less than 2% of the time. Thus, one ; 
■can : wjth high -confidence . link- together all ; 
; -U-unitigs that are linked by at least two 2- or. 
: 10-kbp, mate, pairs producing intermediate- < 
^sizedr scaffolds that .are : then Tecursively 
linked .together by ^ confirming -50-kbp mate 
pairs, and BAC end. sequences. .This process 
yielded scaffolds that are on the order \ of 
megabase pairs in size with gaps between 
. their contigs that generally correspond to re- 
petitive elements and occasionally to small 
^sequencing gaps. These scaffolds reconstruct . 
r the majority of the unique sequence .within a : ; 
genome. 

For the Drosophxla assembly, we engaged r. 
in a three-stage repeat .resolution strategy v> 
where - each ; : stage . was : progressively -more t 



5.11XCelera Reads 
39X mate pairs 



aggressive and thus more likely to make a 
mistake. For the human assembly, we contin- 
- ued to use the first "Rocks" substage where 
-■ • all unitigs with a good, but not definitive, 
\ discriminator score are placed in a scaffold* 
■ v> gap.' This was. done with the condition that 
'. two or -more mate pairs with one of their 
breads already in- the. scaffold unambiguously 
; place the unitig in the given gap., We estimate 
wft the. probability of inserting a vuriitig)into an 
: ;£.mcorrect.gap with this strategy tube less than : 
v ; :10 ? based on i'-a' probabilistic analysis: - • 
:> • v We revised the ensuing "Stones"/substage 
:; v of the human assembly, making it more like 
: . (the mechanism suggested uTour earlier work 
;{ (4J)j For each gap, every read R that is placed 
V >. in the gap by virtue of its mated pair M being 
1 in a contig of the scaffold and. implying R's 
replacement is collected. Celera's mate-pairing 
u -information is correct more than 99% of the 
:>: time. Thus, almost every, but not all, of the 
• reads in the setbelong in the gap, and when 
: .a read does not belong it rarely agrees with 
, the remainder of the reads.- Therefore, we 
; simply assemble this set of reads within the 
y gapr eliminating any reads that conflict with 
the assembly. This operation proved much 
more reliable than the one it replaced for the 
iDrosophila assembly; in the assembly of a 
: ' simulated shotgun data set of human chromo- . 

■ Public Bactlq s 
(from 1 33.421 BAgsj 




Bactlgs & Cetera pairs 
\jblnnedpy BAC) 



Combining""^. 
Assembler^^ 




Components f 




Components 2 




^ Components,, 





WGA Assembly CSA Assembly 



ttV- ^ W« assembly strategy. Each oval denotes a computation 
S ei ?°" n m f the ^.on indicated by its label with the labels on arcs between ovals 
describing the nature of the objects produced and/or consumed by a process TOs fieu re 
summanzes the discussion in the text that defines the terms and phrases used * 
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. some 22, all stones were placed correctly. ; « 
The final method of resolving gaps is to, 
fill them with assembled BAC data that cover 
the gap. We call this external gap •'walking.".' 

• We did not include the very aggressive "Peb- 
bles" substage described in our Drosophila 

; work, , which made enough mistakes so /as. to $ 
■. produce repeat reconstructions for long inter- • 

• spersed elements whose quality was : only ■ 
. 99.62% correct We decided .mat for the ;hu- J 
.man genome it was philosophically better not 

/ to introduce a step that was certain to produce • 
, less , man 99.99% accuracy. The cost was a 
somewhat larger number of gaps, of some- • 
what larger size. . 

< At the final stage of the assembly process, : 
and also at several intermediate points, a I 
. consensus sequence of every contig is pro-/ 
duced. Our algorithm is driven by the princi- 
ple of maximum parsimony, with quality^ : r , 
value-weighted measures for evaluating each 
base. The net effect is a Bayesian estimate of 
the correct base to report at each position. 
/ Consensus generation uses Celera data when-v. 
ever it is present. In the event that no Celera 
data cover a given region, the BAC. data 
sequence is used. 

A key element of achieving a WGA of the 
human genome was to parallelize the Overlap- 
per and the central consensus sequencefcon- ; 
strocting subroutines. In addition, memory was ■ 
a . real issue — a straightforward application of : 
the software we had built for Drosophila would • 
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■ .have, required a computer with a-600-gjgabyte 
RAM. By making the Overlapper and Unitigger 
incremental, we were able to achieve the.same 

■ * computation with a maximum of instantaneous 
,• usage of 28 gigabytes of RAM. Moreover, the 
v incremental nature of the first three , stages al- 
; v . .lowed us to continually update the state of this 
w.,part of the. computation as,data were delivered 
:r .;and.fhen pe]rform ; a.7-day run. tocomplete Scaf- 
/; folding ,and Repeat . Resolution whenever de- 
; J sired . For our assembly 'operations, the total ■' 
:> compute infrastructure consists of 10 four-pro- 
cessor SMPs with 4 gigabytes of memory per 

: cluster (Compaq's . ES40, Regatta) and a 16- 
. processor NUMA, machine with 64 gigabytes 
: ; of memory (Cbmpaq's GS 1 60, Wildfire). The. 
- ' total compute .for . a run of the, assembler was 
. roughly 20,000 CPU hours. ^ i , 
v - ■ The iassembly of Celera's . data, together 
, v with the shredded bactig data, produced a set of 
scaffolds totaling 2.848 Gbp in span and con- 
Jsisting of 2.586 Gbp of sequence; The chaff, or 
i;set of reads not. incorporated ;m : the.- assembly, v 
^.numbered 11:27 million (26%), which is con- 
sistent wim our experience for Drosophila. 
. More, than 84% of . the genome was covered by 
scaffolds :>1 00 kbp long, and these, averaged 
91% sequence and 9% gaps with a total of 
; 2.297 Gbp of sequence. There were a total of 
' 93,857 gaps among the 1637. scaffolds >100 ... 
kbp. The average scaffold size was 15 Mop, 
the average contig size was 24 06 kbp, arid the 
^average gap size was 2.43 kbp, where the dis- . 



.' - •.tributioh of each- was essentially exponential 

• More than 50% of all gaps were less than 50( 

* bp long, >62% of all gaps were less than 1 kbj 
-long, and no gap was >100 kbp long. Similar 

* \ ly,more than 65% of the sequence is in contig 
■- >30 kbp, more than 31% is in contigs >10f 

*> kbp, and the largest contig "was 1.22 Mop long 
;■ • Table . 3, .gives • detailed .summary ^statistics fo. 
-f -the .structure , of 

• /.comparison to .the. compartmentalized shotgun 

assembly..' \ /. 

2.4 Compartmentalized shotgun 
assembly 

In addition to the WGA . approach, we pur- 
. sued a localized assembly approach that was 
••.intended to : subdivide the : genome into seg- 
ments, each of ( which could be shotgun as- 
sembled individually. We expected that this 
would help in resolution of large interchro- 
mosomal duplications and improve the statis- 
- tics, for calculating U-unitigs. . The compart- 
>mentalized assembly -process : :involved clus- 
tering Celera reads and bactigs into large, 
multiple megabase regions of the genome, 
and then running the WGA assembler on the 
Celera data and - shredded, faux reads ; ob- 
tained from the bactig data. 
: /The first phase of the CS A strategy was to 
separate Celera reads; into those' that matched 
the BAC contigs for a particular PFP BAC 
entry, and those that id not match any public 
. data. Such matches: must t be. guaranteed to 



Table 3. Scaffold statistics for whole-genome and compartmentalized shotgun assemblies. 

Scaffold size 



All 



>30 kbp 



>100 kbp 



>500 kbp" 



>1000 kbp 



No. of bp in scaffolds 2,905,568,203 
(including intrascaffold gaps) 

■ No. of bp in contigs 2.653,979,733 

No. of scaffolds 53,591 

No. of contigs 1 70^033 

No. of gaps 116,442 

No. of gaps £1 kbp 72,091 

Average scaffold size (bp) 54,21 7 

Average contig size (bp) 15,609 

Average intrascaffold gap size 2,161 
(bp) 

Largest contig (bp) 1,988,321 

% of total contigs 100 

No. of bp in scaffolds 2,847,890,390 
. (including Intrascaffold gaps) . . . 

No. of bp In contigs 2,586,634,108 

No. of scaffolds ■ 118,968 

No. of contigs 221,036 

No. of gaps 102,068 

No. of gaps ^1 kbp 62,356 

Average scaffold size (bp) 23,938 

Average contig size (bp) 1 i f 702 

Average Intrascaffold gap size 2,560 
0>p) 

Largest contig (bp) 1,224,073 

% of total contigs 100 



Compartmentalized shotgun assembly 

2,748,892,430 , 2,700,489,906 



2,524,251,302 
2,845 
112,207 
109,362 
69,175 
966,219 
22,496 
2,054 

1,988,321 
95 

Whole-genome assembly 



. 2,574,792,618 

2,334,343,339 . 
2,507 
99,189 
- 96,682 
60,343 
1.027,041 
23,534 
2,487 

1,224.073 
90 



2.491,538,372 
1,935 
107.199 
105,264 
67,289 
1,395.602 
23,242 
1,985 

1,988,321 
94 

2.525,334,447 

2,297,678,935 
1,637 
95,494 
93.857 
59,156 
1,542,660 
24,061 
2.426 

1.224.073 
89 



2,489.357.260 

2,320,648,201 
1,060 
93.138 
92.078 
59.915 
2.348.450 
24.916 
1.832 

1.988.321 
87 



2.328.535.466 

2,143,002,184 
818 
84,641 
* 83,823 
54,079 
2,846,620 
25,319 
2,213 

1.224,073 
83 



2.248,689.128 

2.106,521.902 
721 
82.009 
81,288 
53,354 
3,118,848 
25,686 
1,749 

1,988,321 
79 

2,140,943,032 

1,983,305.432 
554 
76,285 
75,731 
49,592 
3,864,518 
25.999 
2,082 

1,224,073 
77 
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properly place a Celera read, so all reads were 
first masked against a library of common 
repetitive elements, and only matches of at 
least 40 bp to unmasked portions of the read 
constituted a hit Of Celera's 27.27 million 
reads, 20.76 million matched a bactig and 
another 0.62 million reads,. which did not 
have any matches, were nonetheless identi- 
fied as belonging in the region of the bactig's 
BAC because their mate matched the bactig. 
. : Of the remaining reads, : 2.92 million were 
completely screened out and so could not be 
u matched, but the other 2.97 million reads had 
.. unmasked sequence totaling 1.189 Gbp that 
• were not found in . the GenBank data set. 
. Because the Celera data are 5.1 1 X redundant, ; 
we estimate that 240 Mbp of unique Celera 
sequence is not in the GenBank data set. :. 

In the next step of the CSA process, a 
combining assembler took the relevant 5X 
Celera reads and bactigs for a BAC entry, and 
produced an assembly of the combined data 
for that locale. These high-quality sequence ; 
reconstructions were a transient result whose 
utility was simply to provide more reliable 
information for the purposes of their tiling «. 
■- into sets of overlapping and adjacent scaffold ,'; 
sequences in the next step. In outline, the 
combining assembler first examines the set of 
matching Celera reads to determine if there 
are excessive pileups .indicative of :un-. 
screened repetitive elements. Wherever these 
occur, reads in the repeat region whose mates < 
have not been mapped to consistent positions ■•: 
are removed. Then all sets of mate pairs that 
consistently imply the same relative position : ' 
of two bactigs are bundled into a link and -f 
weighted according to the number of mates in 
the bundle. A "greedy" strategy then attempts 
to order the bactigs by selecting bundles of 
mate-pairs in order of their weight. A selected 
mate-pair bundle can tie together two forma- 
tive scaffolds ^incorporated to form a 
single scaffofd only if it is consistent with the ■ 
majority of links between contigs of the scaf- 
fold. Once scaffolding is complete, gaps are 
filled by the "Stones" strategy described 
above for the WGA assembler. 

The GenBank data for the Phase 1 and 2 
BACs consisted of an average of 19.8 bactigs 
per BAC of average size 8099 bp. Applica- 
tion of the combining assembler resulted in 
individual Celera BAC assemblies being put 
together into an average of 1.83 scaffolds 
(median of 1 scaffold) consisting of an aver- 
age of 8.57 contigs of average size 18,973 bp. 
In addition to defining order and orientation 
of the sequence fragments, there were 57% 
fewer gaps in the combined result For Phase 
0 data, the average GenBank entry consisted * 
of 91.52 reads of average length 784 bp. 
Application of the combining assembler re- 
sulted in an average of 54.8 scaffolds consist- 
ing of an average of 58.1 contigs of average 
size 873 bp. Basically, some small amount of 
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• . assembly took place, but not enough Celera 
o data were matched to truly assemble the 0.5 X 

to IX data set represented by the typical 
Phase . 0 BACs. The . combining assembler 
» was also applied to the Phase 3 BACs for • 

• .SNP identification,- confirmation of assem- : 
: bly, and localization of the Celera reads. The 
► phase 0 data suggest that a combined whole-. 

genome shotgun data. set and IX light-shot-. 

gun of BACs will not yield good assembly of 
V. B AC .regions; at .least 13 X dight-shotgun - of . 
; : each BAG is needed! -;.*.. .> 



CF v ric or contaminating sequence , (from 
anorner part of the genome) would not be 
incorporated into the reassembly of the com- 
ponent because it did not belong there. In 
effect, the previous steps, in. the ,CS A process 
~ served only to bring together Celera frag- 
ments , and PFP. data relevant to a. large con- 
tiguous segment of the. genome^ wherein we 
applied the assembler used for WGA to pro- 
• duce an ab initio assembly of the region. 
- ;v WGA assembly of the components, result- • 
ed in a set of scaffolds totaling 2 ;906 : Gbp in . 
*The; 5.895inilliori::Celera -fra^ents -not / . : span ;and : consisting of^2.654:;Gbp^6f ;se-^ 
: ^ matching the GenBank data were assembled :\. :■ quence. The chaff, or set of reads not incor- 
j with our :whole-genome assembler. .The as- .-. porated iinto , the . assembly, -numbered 6.17 c 
Vsembly resulted in a set of scaf!blc^"to^almg ?w million, (or .22%. More than <;90;0%;|of the 
? : 442 Mbp in span and consisting of 326 Mbp .-genome was covered by scaffold spanning 
: of sequence. More than 20% of the scaffolds .: >100 kbp long, ' and : these averaged .92.2% . 
were. >5 kbp long, and these averaged 63% . sequence and 7.8% gaps with a total of 2.492 

- sequence.and 27% gaps with a total of 302 . Gbp of sequence. ..There were, a total of 

- Mbp of sequence. All scaffolds >5 kbp were r , 105,264 gaps among the 107,199 contigs that ; 
forwarded along with all scaffolds produced' belong to the. 1 940 .scaffolds : spanning > 100 - 



by the combining^ assembler to the subse- 
quent tiling phase. 

• At this stage, we typically had one or two 
scaffolds for every. BAC, region constituting 
/..at least .95% ;of the relevant sequence, and a 



Jcbp. The average scaffold size was 1.4 Mbp, ,*■ 
the average contig size was 23.24 kbp, and 
the average gap size was 2.0 kbp where each 
distribution of sizes was exponentials As 
5 such, averages tend to :i be,underrepresentative 



-collection of disjoint Celera-unique scaffolds. - ' of the majority of the data. Figure 5 shows a 



The next step in developing the genome com- 
ponents was to determine the order and over-.: 
lap tiling of. these BAC- and Celera-unique 
scaffolds across the genome. For this, we 
used Celera's 50-kbp mate-pairs information, I 
and BAC-end pairs (75) and sequence tagged 



histogram of the bases in scaffolds of various 
#size langes. Consider: also : that 'more than 
49% of all gaps were <500 bp long, more 
than 62% of all gaps were <1 kbp, and all 
gaps are < 100 kbp long. Similarly, more than 
73% i of the sequence, is in contigs > 30 kbp, 



site (STS) markers (44) to provide long- • more than 49% is in contigs >100 kbp, and. 



range guidance and/chromosome separation. 
Given the relatively manageable number of ? 
* ; scaffolds, we chose not to produce this tiling 
in a fully automated manner, but to compute 
an initial tiling with a good heuristic and then 
use human curators to resolve discrepancies 
or missed join opportunities. To this end, we 
developed a graphical user interface that dis- 
played the graph of tiling overlaps and the 
evidence for 4 each.* A human curator could 
then explore the implication of mapped STS - 
data, dot-plots of sequence overlap, and a 
visual display of the mate-pair evidence sup- 
porting a given choice. The result of this 
process was a collection of "components," 
where each component was a tiled set of 
BAC and Celera-unique scaffolds that had 
been curator-approved. The process resulted 
in 3845 components with an estimated span 
of 2.922 Gbp. 

In order to generate the final CSA, we 
assembled each component with the WGA 
algorithm. As was done in the WGA process, 
the bactig data were shredded into a synthetic 
2X shotgun data set in order to give the 
assembler the freedom to independently as- 
semble the data. By using faux reads rather 
than bactigs, the assembly algorithm could 
correct errors in the assembly of bactigs and 
remove chimeric content in a PFP data entry. 



the largest contig was 1.99 Mbp long. Table 3 
^provides summary statistics for the structure 
of this assembly with a direct comparison to 
the WGA assembly. , . 

2.5 Comparison of the WGA and CSA 
scaffolds 

Having obtained two assemblies of the hu- 
man genome via -independent computational 
processes (WGA and. CSA), we. compared 
scaffolds from the two assemblies as another 
means of investigating their completeness, 
consistency, and contiguity. From each as- 
sembly, a set of reference scaffolds contain- 
ing at least 1000 fragments (Celera sequenc- 
ing reads or bactig shreds) was obtained; this 
amounted to 2218 WGA scaffolds and 1717 
CSA scaffolds, for a total of 2.087 Gbp and 
2.474 Gbp. The sequence of each reference 
scaffold was compared to the sequence of all 
scaffolds from the other assembly with which 
it shared at least 20 fragments or at least 20% 
of the fragments of the smaller scaffold. For 
each such comparison, all matches of at least 
200 bp with at most 2% mismatch were 
tabulated. 

From this tabulation, we estimated the 
amount of unique sequence in each assembly 
in two ways. The first was to determine the 
number of bases of each assembly that were 
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not covered by a matching segment in the , 
. other assembly. Some 82.5 Mbp of the WGA •■ 
(3.95%) was not covered by the CSA, where- 
as 204.5 Mbp (8.26%) of the CSA was not 
covered by the WGA. This estimate did not 
require any consistency .of the assemblies or . . 
any uniqueness : of the matching segments. 
Thus, another analysis was conducted in ■. \ 
which matches of less than 1 kbp between a ; : . 
; pair of scaffolds were excluded unless they*-., 
were confirmed by other matches having a ..■ •:, 
consistent order and orientation. This gives:, 
some measure of consistent coverage:; 1.982 
Gbp (95.00%) of the WGA is covered by the 
CSA, and 2.169 Gbp (87.69%) of the CSA is : 
covered by the WGA by this more stringent 
measure: -j- .'; : 

The comparison of WGA to CSA also..;; 
permitted evaluation of scaffolds for structur- 
al inconsistencies. We looked for instances in , s 
which a large section of a scaffold from one 
, assembly, matched only one scaffold from the . . 
. other assembly, but failed to match over the 
: full . length of the overlap implied by the * 
matching segments. An initial set of candi - 
dates was identified automatically, and then 
each candidate was inspected by hand. From : : 
this process, we identified 31 instances in-, 
which the assemblies appear to disagree in a 
nonlocal fashion. These cases are being fur- 
ther evaluated to determine which assembly .= 
is m error and why. 

. In addition, we evaluated local inconsis- . .-. 
tencies of order or orientation. The following : : , 
results exclude cases in which one cpntig in 
one assembly corresponds to more than one 
overlapping contig in the other assembly (as 
long as the order and orientation of the latter 
agrees -with the-positions they match in the 
former). Most of these small rearrangements 
involved segments on the order of hundreds 
of base pairs and rarely > 1 kbp. We found a 
total of 295 kbp (0.012%) in the CSA assem- 
blies that were locally inconsistent with the 
WGA assemblies, whereas 2.108 Mbp 
(0.11%) in the WGA assembly were incon- 
sistent with the CSA assembly. 
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. The CSA assembly was a few percentage 
points betterin terms of coverage and slightly 
more consistent than the . WGA, because it' 

• was in effect performing a few thousand shot- 
gun assemblies of megabase-sized problems, 
whereas the WGA is performing a shotgun 
assembly of a gigabase-sized problem: When - 
one considers the increase of two-and-a-half . 

• orders. ofmagnitude in problem size, 1 the in-.'. 
: formation loss between the two is remarkably : 
.small. : Because CSA was logistically easier to 
deliver and the better of the two results avail-; 

• able at the . time . when downstream analyses 
needed to be begun,' all; subsequent analysis 
was performed on this assembly. • .: 

2.6 Mapping scaffolds to the genome 

- The final step in assembling the genome was to ; 
order and orient .the scaffolds on the chromo- : 
somes.- We first grouped scaffolds. together on 
the basis of their order in the components from 
CSA: These grouped scaffolds were reordered 
by ..examining residual .mate-pairing .data be- : 
tween the scaffolds. We next mapped the scaf- 
fold groups onto the chromosome using physi- 
cal mapping data. This step depends on having . 
reliable high-resolution map information such 
that each scaffold will overlap multiple mark- \ 
ers. There are two genome-wide types of map 
information available: higji-density STS maps 
and fingerprint maps of BAC clones developed , 
at Washington University (45). Among the ge^ 
nome-wide; STS maps, GerieMap99 (GM99) . 
has the most markers and therefore was most 
useful for mapping scaffolds. The two different 
mapping approaches are complementary to one 
another. The fingerprint maps should have bet- 
ter local order because they were built by com- 
parison of overlapping BAC clones. On the 
other hand, GM99 should have a more reliable 
long-range order, because the framework mark- . 
ers were derived from well-validated genetic^ 
maps. Both types of maps were used as a 
reference for human curation of the compo- 
nents that were the input to the regional assem- 
bly, but they did not determine the order of 
sequences produced by the assembler. 
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Scaffold Size 

Fig. 5. Distribution of scaffold sizes of the CSA For each range of scaffold sizes, the percent of total 
sequence is indicated. 



-...In order to. determine the effectiveness of 
the fingerprint .maps and GM99 for mapping 
;> scaffolds, we first examined the reliability of 
I-.. these maps by comparison with large scaf- 
s v folds.,Only 1% of the STS markers on the 10 
•r largest scaffolds (those >9 Mbp) were 
1. 1; mapped on •:• a . ^different chromosome . on 
: . GM99. Two. percent of the STS markers dis- :. 

agreed ; in position .by > more ..than five .frame- 
. ; • . work e bins. However, r- for the fingerprint - 
v -.maps, >*a 2% v chromosome, discrepancy was 
observed, arid ; on average ; 23.8% of /BAC 
• locations in . the scaffold-sequence ;disagreed 
: with fingerprint map placement by more than 
. five BACs. - When further, examining the 
,.i source of discrepancy, it was found that most 
: . of the discrepancy came from 4 of the 10 
. ' ^scaffolds; indicating this there is variation in 
. ^the,quality of either the map or the scaffolds. 
\A11 four scaffolds were assembled, as well as 
. the other six, as judged by clone coverage 
•» analysis, and showed the same low discrep- 
: -ancy rate . to .GM99, and thus we. concluded : 
- : : that the fingerprint map global order in these ■' 
» cases was not reliable. Smaller scaffolds had 
is a higher discordance rate with GM99 (4.21% 
v of STSs were discordant by more than five 
'./■: framework bins), but a lower discordance rate 
j with the fingerprint maps (11% of BACs 
:. ^disagreed with fingerprint maps by more than 
.five BACs). This observation agrees with the 
i.£ clone coverage analysis (46) that Celera scaf- 
• ;.fold construction .was /.better;- supported ;by H 
long-range mate j>airs in larger scaffolds than ' 
in small scaffolds. 

We created two orderings of Celera scaf- 
folds on the basis of the markers (BAC or 
STS) on these maps. Where the order of 
scaffolds- agreed between GM99- and the 
;•; .WashU BAC map, we had a high degree of 
: V J confidence that that order was correct; these 
- scaffolds were -termed "anchor scaffolds." 
Only scaffolds with a low overall discrepancy 
rate v with both maps were considered anchor 
scaffolds. Scaffolds in GM99 bins were al- 
lowed to permute in their order to match 
WashU ordering, provided they did not vio- 
late their framework orders. Orientation of 
individual scaffolds was determined by the 
presence of multiple "mapped markers with . 
consistent order. Scaffolds with only one 
marker have insufficient information to as- 
sign orientation. We found 70.1% of the ge- 
nome in anchored scaffolds, more than 99% : 
of which are also oriented (Table 4). Because • 
GfvI99 is of lower resolution than the WashU 
map, a number of scaffolds without STS 
matches could be ordered relative to the an- 
chored scaffolds because they included se- 
quence from the same or adjacent BACs on 
the WashU map. On the other hand, because 
of occasional WashU global ordering dis- 
crepancies, a number of scaffolds determined 
to be "unmappable" on the WashU map could 
be ordered relative to the anchored scaffolds 
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with GM99. These scaffolds were termed 
"ordered scaffolds." We found that 13.9% of 
the assembly could be ordered by these ad- 
ditional methods, and thus 84.0% of the ge- 
nome was- ordered unambiguously. 
. Next, all scaffolds that could be placed, 
but not ordered, between anchors were- as- 
signed to the interval between the anchored 
scaffolds and were deemed to be "bound- 
ed" between them. For example, small scaf- . 
folds having STS hits from the same Gene- . 
Map bin or hitting the same BAG cannot be 
ordered relative to .each other, but can be ., 
assigned a placement boundary relative to . 
other anchored, or .ordered scaffolds. The v 
remaining scaffolds either had no localiza- . 
tion information, conflicting information, . 
or could only be. assigned to a generic : \ 
chromosome location. Using the above ap- 
proaches, —98% of - the genome was an- 
chored, ordered, or bounded. . ■*- : - 
Finally, we assigned a location for each , 
scaffold placed on the chromosome by > 
spreading out the scaffolds per chromosome. > 
We assumed that the remaining unmapped 
scaffolds, •constituting. 2% of the. .genome, 
were distributed evenly across the genome.. 1 ; 
By dividing the sum of unmapped scaffold .. 
lengths with the sum of the number of .;. . 
mapped scaffolds, we arrived at an estimate v 
of interscaffold gap of 1483 bp. This gap was 
used to separate all the scaffolds on each \ 
chromosome and to. assign an offset in the 
chromosome. 

. During the.scaffold-mapping effort, we eh- 
countered many problems that resulted in addk 
tional quality assessment and validation analy- 
sis. At least 978 (3% of 33,173) BACs were 
believed to have sequence data from more than 
one location in the genome (47). This is con- 
sistent with the bactig chimerism analysis re- 
ported above in the Assembly Strategies sec- 
tion. These BAGs could not- be assigned to . 
unique positions within the CSA assembly and 
thus could not be used for ordering scaffolds. 
Likewise, it was not always possible to assign 
STSs to unique locations in the assembly be- 
cause of genome duplications, repetitive ele- 
ments, and pseudogenes. 

Because of the time required for an ex- 
haustive search for a perfect overlap, CSA 
generated 21,607 intrascaffold gaps where 
the mate-pair data suggested that the contigs 
should overlap, but no overlap was found. 
These gaps were defined as a fixed 50 bp in 
length and make up 18.6% of the total 
1 16,442 gaps in the CSA assembly. 

We chose not to use the order of exons 
implied in cDNA or EST data as a way of 
ordering scaffolds. The rationale for not us- 
ing this data was that doing so would have 
biased certain regions of the assembly by 
rearranging scaffolds to fit the transcript data 
and made validation of both the assembly and 
gene definition processes more difficult. 
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7 Assembly and validation analysis 

We analyzed the assembly of the genome 
from the perspectives of completeness 
(amount of coverage of the genome) and 
correctness ;(the structural - accuracy, of the 
order and. orientation and the consensus se- 
quence of .the assembly). 

Completeness. Completeness is defined as 
the percentage of the euchromatic sequence 
: represented in the assembly. - This cannot be 
. -known wim » absolute /certamty Wtil the eu- :> 
; chromatin. ; sequence^ has /-been ^completed. 
However, it is possible to estimate complete- ;;. 
ness on the basis of (i) the estimated sizes of - 
v intrascaffold .gaps; (ii) coverage of the two v 
published chromosomes, 21 and 22 (48, 49); 
and (iii) . analysis of, the percentage of an ; 
, independent set- of random sequences (STS 

markers) .contained in . the assembly. . The ; ; 
v whole-genome -, libraries contain : heterochro- J ■ 
matic sequence and, although no attempt has 
been made- to assemble it, there may be in- 
« stances of unique sequence embedded in xe- . 
, gions of heterochromatin as were observed in 
Drosophila (50, 51). . 

. : , The sequences of human chromosomes 21 - 
rand 22 -have been completed to high' quality 
and published (48, 49). Although this se- 
quence, served as input to the .assembler, the , 
-finished sequence was shredded into a shot- 
gun data set so that the "assembler had the . 
opportunity to assemble it differently from , 
the original sequence in the case of structural .. 
polymorphisms or assembly errors in the 
BAC data. In particular, me assembler must , 
be able; to resolve repetitive elements, at. the I: 
scale ;of components , (generally multimega- 
base in size), and so this comparison reveals 
the level to which the assembler resolves 
repeats. In certain areas, the assembly struc- 
ture differs from the published versions of 
chromosomes 21 and 22 (see below). The 
consequence of the flexibility- to assemble 
"finished" sequence, differently on the basis 
of Celera data resulted in an. assembly with 
more segments than the chromosome 21 and y 
22 sequences. We examined the reasons why 
there are more gaps in the Celera sequence 
than in chromosomes 21 and 22 and expect 
that they may be typical of gaps in other 
regions of the genome. In the Celera assem- 
bly, there are 25 scaffolds, each containing at 
least 10 kb of sequence, that collectively span 
94.3% of chromosome 21. Sixty-two scaf- 
folds span 95.7% of chromosome 22. The 
total length of the gaps remaining in the 
Celera assembly for these two chromosomes 
is 3.4 Mbp. These gap sequences were ana- 
lyzed by RepeatMasker and by searching 
against the entire genome assembly (52). 
About 50% of the gap sequence consisted of 
common repetitive elements identified by Re- 
peatMasker; more than half of the remainder 
was lower copy number repeat elements. 
A more global way of assessing complete- 



ness . ; . j measure the content of an independent 
set of sequence data in the assembly. We com- 
pared 48,938 STS markers from Genemap99 
- (5 J) to the /scaffolds. Because -these markers 
:: were not used in the assembly . processes, they 
.-provided a truly independent measure of com- 
c pleteness. ePCR (55) and BLAST ,(54) were 

* used to locate STSs .on.the assembled genome. 
We found 44,524 (91%) of the STSs in the 
mapped genome. ^ An additional ,2648 markers 

>(5.4%) were found by searching .the uhas-" • 
i'. sembled-data or //chiuT.^We ^identified : 1283 v 

• STS markers (2.6%) not/found in either Celera :.' 
sequence or BAC data as' of September .2000, 
raising the possibility that these, markers may 
not be of human origin. If that were the case, ; : 

■ the Celera assembled sequence would represent 
93.4% of the human genome and the unas- 

■ sembled data 5.5%, for a total of 98.9% cover- ;• 
age. -Similarly, we compared : CSA .'.against ' 
36,678 TNG radiation, hybrid markers (55a) < 
using the same method We found that 32,371 ■■. 
markers. (88%) .'were ..located, in the .mapped ..• 
CSA scaffolds, .with 2055 markers (5.6%) . 
found in the remainder. This gave a 94% cov- 

-erage of the genome. through another genome- 
wide survey. 

Correctness. Correctness is defined as the 
structural and sequence accuracy of the as- 
x sembly. Because the source sequences for the 
Celera data and the GenBank data are from f 
different individuals, -we could not ; directly 
1 compare the consensus -sequence of the as-. 

Table 4. Summary of scaffold mapping^ Scaffolds • 
<were mapped to the genome with different levels : 
of confidence (anchored scaffolds have the highest : : 
confidence; unmapped scaffolds have the lowest). 
Anchored scaffolds were consistently ordered by 
the WashU BAC map and GM99. Ordered scaf- 
folds were consistently ordered by at least one of 
the following: the WashU BAC map, GM99. or 
component tiling path. Bounded scaffolds had or- 
der conflicts between at least two of the external 
maps, but their. placements were .adjacent to a 
neighboring -anchored .or -ordered scaffold. Un- 
mapped scaffolds had, at most, a chromosome 
assignment The scaffold subcategories are given 
below each category. 
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length 


Anchored 


1,526 


1.860,676,676 


70 


Oriented 


1,246 


1,852,088,645 


70 


Unoriented 


280 


8.588,031 


0.3 


Ordered 


2,001 


369,235,857 


14 


Oriented 


839 


329,633.166 


12 


Unoriented 


1,162 


39,602,691 


2 


Bounded 


38.241 


368,753,463 


14 


Oriented 


7,453 


274,536.424 


10 


Unoriented 


30.788 


94.217.039 


4 


Unmapped 


11,823 


55.313.737 


2 


Known 


281 


2.505.844 


0.1 


chromosome 
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11.542 


52,807,893 


2 . 
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sembly. against other finished sequence for ^ those that were correct (Table 5). The ;stan- Ct 5 'September 2000^(30, .55b). In this latter 
determining sequencing accuracy at the nu- ..\dard deviations for all Celera libraries werev case/ Celera mate pairs had to be mapped to 
. cleotide level, althoughthis has been done for r quite small,-less than. 15% of the insert • vthe PFP assembly. To avoid mapping errors 
identifying polymorphisms as described in ^ length, with the exception of a few 50-kbp v due to high-fidelity repeats, the only pairs 
Section 6. The accuracy of the consensus . libraries. The 2- and 10-kbp libraries *con-;. mapped were those for which both reads 
sequence is at least 99.9.6% on the basis of a . n tained less than 2% invalid mate pairs, where-, ... t matched at only one location with less than 
•• statistical estimate derived from the quality;-; : as the 50-kbp libraries were somewhat higher \.x 6% differences. A threshold was set such that 
values of the underlying reads. \(~ 10%). Thus, although the mate-pair infor- - setsof five or more simultaneously invalid 

• ■ : /. : : Th Q structural consistency.of the assembly :^ mation was not perfect, its accuracy, was such* ; f mate pairs indicated a ^potential; breakpoint, 
.-.can be measured by materpair analysis. In a ^ that.measuring valid, misoriented, :and .rds-y- . where the construction of the twb assemblies 
, . : ^correct assembly, every ; mated pair of se- ..• . : separated pairs with respect to a given assem-< \: differed. The graphic comparison of the CSA 
. ; . quencing reads should be located on the con-, • bly : was deemed to. be a reliable instrument y, chromosome 21 assembly with the published 
v , sensus sequence .with the. correct separation i-v^r validation pur^^ sequence (Fig. 6A) serves as a validation of 

..and orientation between the pairs. A pair:is : ,-eral mate pairs confirm or deny. an ordering. \ this methodology. Blue tick marks in the 
.termed 'Valid" when the reads are in the .;, ^ ..The clone ^coverage of the genome ; was panels indicate breakpoints. There were a 
^ correct orientation, and the distance between . : ;.; -39X,.. meaning that any given base.pair was, ; : . similar (small) number : of ^breakpoints on 
..Vvthem is within the mean ± 3 standard deyi-.^i; on average, contained in 39 clones or, equiv- v ,both chromosome- sequences. The. exception 
ations of the distribution of insert sizes of the, ^alently,> spanned; by 39 mate-paired reads. ^ was 12 sets of scaffolds in the Celera assem- 
, . library from which the pair was sampled. A\. Areas of low clone coverage or areas with a bly (a total of 3% of the chromosome length 
. pair is termed "misoriented" when the reads A high proportion of invalid mate pairs would rin >212 single-contig scaffolds) : that -were 
are not correctly oriented, and is termed "mis- indicate potential assembly problems. , We . mapped to the wrong positions because they 
separated" when the distance between the ^computed the .coverage of each base in the were too small to be mapped reliably. Figures 
;■ • reads is not in the correct range but the reads - assembly by validate pairs (Table 6). In ;,6 and 7 and Table .6 illustrate the mate-pair 
are correctly oriented. The mean ± the stan- . summary, for scaffolds >30 kbp in length, differences and breakpoints between the two 
dard deviation of each library used by the , ; less, than 1% of the Celera assembly was in ^ assemblies.- There was a higher percentage of 
assembler was determined -as .described ..regions of less than 3 X clone coverage. Thus,.* misoriented and misseparated mate pairs in 
V . above. To validate these, we examined, all ; more than 99% v of the assembly, including ; the large-insert libraries (50 kbp and BAC. 
reads mapped to the finished sequence of : order and orientation, is. strongly supported ends) than in the small-insert libraries in both 
chromosome 21 (48) and determined how by this measure alone. . assemblies (Table. 6). The large-insert librar- 

many incorrect mate pairs there were as a ; ....... We examined the locations and number of . ies are more likely to identify discrepancies * 

result of laboratory tracking errors and chi- ^ alL rmsoriented and misse^ 
... merism (two different segments of toe ge^^ the CSA " ( the genome.' ;The.vgraphic >comparison;; be- - 

: nbme cloned mto the same plasmic^, and ho 

tight the distribution of insert sizes was for performed a study of the PFP assembly as of . (Fig. 6, B and C) shows that there are many 



Tables. Mate-pair validation. Celera fragment sequences were mapped to of mate pairs tested). If the two mates had incorrect relative orienta- 
the published sequence of chromosome 21. Each mate pair uniquely tion or placement, they were considered invalid (number of invalid mate 
mapped was evaluated for correct orientation and placement (number pairs). 
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gene boundaries. During this process, multiple... , being joined together, resulting in an annotation v,. the region of the genome under analysis was 
hits to the same region were collapsed to a that artificially concatenated these gene models. >■ promoted to the status of an Otto annotation 
coherent set of data by tracking the coverage of... . Next, known genes (those with exact match- : Because the- genome sequence has gaps and 
a region. For example, if a group of bases was v. es of a full-length cDNA sequence to the ge- sequence errors such as fiameshifts it was not 
represented by multiple overlapping ESTs, the nome) were identified, and the region corre- always .possible- to predict a transcript that 
union of these regions matched by the set of ,sponding to the cDNA was. annotated as a ./ agrees precisely with the experimentally deter- 
ESTs on the scaffold was marked as being . : predicted transcript. A subset of the^curat- : mined cDNA sequence. A total of 6538 genes 
supported by EST evidence. This resulted in a. , ; ed human gene set RefSeq from the Nation- in our inventory were identified and transcripts 
series of "gene bins," each of which was be-. ... al . Center for ..Biotechnology Information . predicted in this way. . . 

lieved to contain a single gene. One weakness of. (NCB1) was included as a data set searched in . . ,:■ Regions that have a substantial amount of 
this initial implementation of the algorithm was , the computational pipeline. If a RefSeq tran- sequence similarity,' but do .not match known 
in predicting gene boundaries in regions of tan- script matched the genome assembly for at least genes, were analyzed by that part of the Otto 
demly duplicated genes. Gene clusters frequent- .50% of its length at >92% identity, then the system that.uses the sequence similarity in- 
ly resulted in homologous neighboring genes \ SJM4 (63) alignment of the RefSeq transcript to formation to predict a transcript Here Otto 





Fte. 6. Comparison of the CSA and the PFP assembly. 
(A) All of chromosome 21, (B) all of chromosome 8, 
and (C) a 1-Mb region of chromosome 8 representing 
a single Celera scaffold. To generate the figure, Celera 
fragment sequences were mapped onto each assem- 
bly. The PFP assembly is Indicated in the upper third 
of each panel; the Celera assembly is indicated in the 
lower third. In the center of the panel, green lines 
show Celera sequences that are in the same order and 
orientation in both assemblies and form the longest 
consistently ordered run of sequences. Yellow lines 
indicate sequence blocks that are in the same orien- 
tation, but out of order. Red lines indicate sequence 
blocks that are not in the same orientation. For 
clarity, In the latter two cases, lines are only drawn 
between segments of matching sequence that are at 
least 50 kbp long. The top and bottom thirds of each 
panel show the extent of Celera mate-pair violations 
(red, misoriented; yellow, incorrect distance between 
the mates) for each assembly grouped by library size. 
(Mate pairs that are within the correct distance, as 
expected from the mean library insert size, are omit- 
ted from the figure for clarity.) Predicted breakpoints, 
corresponding to stacks of violated mate pairs of the 
same type, are shown as blue ticks on each assembly 
axis. Runs of more than 10,000 Ns are shown as cyan 
bars. Plots of all 24 chromosomes can be seen in Web 
fig. 3 on Science Online at www.sciencemag.org/cgi/ 
content/full/291/5507/1304/OC1. 
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man genome. The sequence from the region 
of genomic DNA contained in a gene bin was 
vextracted,- and the subsequences supported by 
• any homology evidence were marked (plus 100 
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Fig. 7. Sdiematic view of the distribution of breakpoints and large gaps 
on all chromosomes. For each chromosome, the upper pair of lines 
represent the PFP assembly, and the lower pair of lines represent Celera's 



CO 



assembly. Blue tick marks represent breakpoints, whereas red tick marks 
represent a gap of larger than 10,000 bp. The number of breakpoints per 
chromosome is indicated in black, and the chromosome numbers in red. 
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bases flanking these regions). The other bases . 

. in the region, those not covered by any homol- : 
ogy evidence, were replaced by N's. This se- 
quence segment, with high confidence regions 1 
represented by. the consensus . genomic se-v 
quence and the remainder represented by N's, 
was then evaluated by Genscan to see if a 
consistent gene model could be generated. This ., 
procedure simplified the gene-prediction task :: 
.by first establishing the boundary for the gene 
(not a > strength of most gene-finding algo- 

..rithms), and by . elinunating regions with no 

. supporting evidence. If Genscan returned a 
plausible gene model, it was further evaluated 
•before being promoted to an "Otto" annotation. 
The final Genscan predictions were . often quite 

. different from the prediction that Genscan re-.:> 
turned on the same region of native genomic , 
sequence. A weakness of using Genscan to * 
refine the gene model is the loss of valid, small 
exons from the final annotation." 

The next step in defining gene structures 
based on sequence similarity was to compare, 
each predicted transcript with the homology- . 
based evidence that was used in previous steps 
to evaluate the depth of evidence for each exon 
in the prediction. Internal exons were consid- 
ered to be supported if they were covered by 
homology evidence to within ±10 bases of 
their edges. For first and last exons, the internal 
edge was required to be within 10 bases, but the 
external edge was allowed greater, latitude to. 
allow for 5'. and 3' . untranslated regions 
(UTRs). To be retained, a prediction for a 
multi-exon gene must have evidence such that 
the total number of "hits," as defined above, 
divided by the number of exons in the predic- 
tion must be >0.66 or must correspond to a 
RefSeq sequence.. A single-exon gene must be 
covered by at least three supporting hits (±10 
bases on each side), and these must cover the 
complete predicted open reading frame. For 
a single-exon gene, we also required that 
the Genscan prediction include both a start 
and a stop codon. Gene models that did not 
meet these criteria were disregarded, and 

Table 7. Sensitivity and specificity of Otto and 
Genscan. Sensitivity and specificity were calculat- 
ed by first aligning the prediction to the published 
RefSeq transcript, tallying the number (N) of 
uniquely aligned RefSeq bases. Sensitivity is the 
ratio of N to the length of the published RefSeq 
transcript Specificity is the ratio of N to the 
length of the prediction. All differences are signif- 
icant (Tukey HSD; P < 0.001). 



Method 


Sensitivity 


Specificity 


Otto (RefSeq only)* 


0.939 


0.973 


Otto (homology)f 


0.604 


0.884 


Genscan 


0.501 


0.633 



•Refers to those annotations produced by Otto using only 
the Sim4-polished RefSeq alignment rather than an evi- 
dence-based Genscan prediction. fRefers to those 
annotations produced by supplying all available evidence 
to Genscan. 
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..those that passed were promoted to Otto 

'predictions/ Homology-based .Otto .predic-. '. 

.tions do not contain 3' and 5' untranslated x 
. sequence. Although three de novo gene-finding . 
/programs [GRAIL,- Genscan, and FgenesH -. 

(63)] were:run as part of the :computational 
. analysis; the results of these programs were not .. 
;.: directly. v used :in making the , Otto: predictions. : 

Otto predicted . 1 1 ,226 ^additional .genes . by 
■;■ means of sequence similarity. " ~ ■ 

: 3.2 Otto validation 

, To validate .the Otto homology-based process 
..and the method that Otto uses. to define the 

structures of known genes, we compared tran- 
. scripts predicted by Otto with their ,correspond- 
\ing (and presumably , correct) transcript from a 

set of 45 12 RefSeq transcripts for which there 
\ was a unique . SLM4 alignment (Table 7)/ In 

- order to evaluate the relative performance of 
Otto and Genscan, we made three comparisons. 
The first involved a determination of the accu- 

- racy of gene models predicted by Otto with , 
vonly homology data other than the correspond- 
ing RefSeq sequence (Otto homology in Table 
. 7). We measured the sensitivity (correctly pre- 
dicted bases divided by the total length of the 
cDNA) and specificity (correctly predicted 

. bases divided by the sum of the correctly and 
; incorrectly predicted bases). Second, we exam- 
ined the sensitivity and specificity of the Otto 
! predictions that were made solely with the Ref- 
;. :Seq sequence, which is the process .that Otto ; 
. .uses to annotate known genes (Otto-RefSeq). \ 
And third, we determined the accuracy of the 
Genscan predictions corresponding to these 
RefSeq sequences. As expected, the alignment 
method (Otto-RefSeq) was the most accurate, 
. and Otto-homology performed better than Gen- 
" scan by both criteria. Thus! 6.1 % of true RefSeq 
: . nucleotides were .not represented in the Otto- : 
■■ refseq annotations and 2.7% of the nucleotides - 
in the Otto-RefSeq transcripts were not con- 
tained in the original RefSeq transcripts. The 
discrepancies could come from legitimate 
differences between the Celera assembly 
and the RefSeq transcript due to polymor- 
phisms, incomplete or incorrect data in the 
Celera assembly, errors introduced by Sim4 
during the alignment process, or the pres- 
ence of alternatively spliced forms in the 
data set used for the comparisons. 

Because Otto uses an evidence-based ap- 
proach to reconstruct genes, the absence of 
experimental evidence for intervening exons 
may inadvertantly result in a set of exons that 
cannot be spliced together to give rise to a 
transcript In such cases, Otto may "split genes" 
when in fact all the evidence should be com- 
bined into a single transcript We also examined 
the tendency of these methods to incorrectly 
split gene predictions. These trends are shown 
in Fig. 8. Both RefSeq and homology-based 
predictions by Otto split known genes into few- 
er segments than Genscan alone. 



.. 3.3 Gene number . 

.^Recognizing 4hat .the .Otto system is quite 
t v conservative, we used a different gene-pre- 
: diction strategy , in regions where the ho- 
mology evidence was less strong. Here the 
results of de novo gene predictions were 
i. used. For these genes, we insisted that a 
^ predicted transcript have at least two of the 

following types of evidence to be included , 
; : in the gene set for further analysis: protein, 
c. human EST, rodent EST, or mouse genome 
fragment matches. This final class of pre- 
. dieted genes is a subset of the predictions 
. made .by the three gene-finding programs 
that were used in the computational pipe- 
. line. , For , these, there^ was not sufficient 
^■. sequence similarity, information for Otto to 
attempt to predict a gene structure. The 
f three, de novo gene-finding programs re- 
sulted in ?. about i 155,695 predictions,, of 
vwhich ~76,410 were nonredundant (non- 
overlapping with one another). Of these, 
,.57,935 did not overlap. known genes or 
- predictions made by Otto.' Only 21,350 of 
. .the gene predictions that did not overlap 
Otto predictions were partially supported 
by at least one type of sequence similarity 
evidence, and 8619 were partially support- 
ed by two types of evidence (Table 8). 
■ The sum of this number (21,350) and the 
■/. number of Otto annotations (17,764), 39,1 14, 
. .is near the upper limit for. the human gene 
^complement; , As seen in Table. 8, : if the re- 
i < quirement Tor other : supporting evidence is 
made more stringent, this number drops rap- 
idly so that demanding two types of evidence 
reduces the total gene number to 26,383 and 
demanding three types reduces it to ~23,000. 
Requiring that a prediction be supported by 
• all four 'categories of evidence is too stringent 
i because it would eliminate genes that encode 
novel proteins (members ; of currently unde- 
scribed protein families). No correction for 
pseudogenes has been made at this point in 
the analysis. 

In a further attempt to identify genes that 
were not found by the autoannotation process 
or any of the de novo gene finders, we ex- 
amined regions outside of gene predictions 
that were similar to the EST sequence, and 
where the EST matched the genomic se- 
quence across a splice junction. After correct- 
ing for potential 3' UTRs of predicted genes, 
about 2500 such regions remained. Addition 
of a requirement for at least one of the fol- 
lowing evidence. types-— homology to mouse 
genomic sequence fragments, rodent ESTs, 
or cDNAs— or similarity to a known protein 
reduced this number to 1010. Adding this to 
the numbers from the previous paragraph 
would give us estimates of about 40,000, 
27,000, and 24,000 potential genes in the 
human genome, depending on the stringency 
of evidence considered. Table 8 illustrates the 
number of genes and presents the degree of 
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confidence based on the supporting evidence. 
Transcripts encoded by a set of 26,383 genes 
were assembled for further analysis. This set 
.. includes the 6538 genes predicted by Otto on 
the basis of matches to known genes, 1 1,226 
.; . j transcripts predicted by Otto based on homol- 
ogy evidence, , and 8619 from the subset of 
. transcripts from de novo gene-prediction pro- 
- grams that have two types of supporting ev- 
; %v. idence; The 26,383 genes areillustrated along^ 
- . . chromosome diagrams in Fig; 1. These are a 
. ; very preliminary set of annotations arid are 
... subject to all the limitations of an automated . 

process. Considerable refinement is still nec- . 
; • essary to improve the accuracy of these tran- 
script predictions. : A11 the predictions and 
, descriptions of genes and the associated evi- 
dence that we present are the product of 
completely computational processes, not ex- 
pert curation. We have attempted to enumer- 
ate the genes in the human genome in such a 
■ way that we have different levels of confi- 
dence based on the amount of supporting 
evidence: known genes, genes with good pro- 
tein or EST homology evidence, and de novo 
gene predictions confirmed by modest ho- 
mology evidence. 

3.4 Features of human gene 
transcripts 

We estimate the average span for a "typi- 
cal" gene in the human DNA sequence to 
be about 27,894 bases. This is based on the 
average span covered by RefSeq tran- 
scripts, used because it represents our high- . 
est confidence set. 

The set of transcripts promoted to gene 
annotations varies in a number of ways. As 
can be seen from Table 8 and Fig. 9, tran- 
scripts predicted by Otto tend to be longer, 
having on average about 7.8 exons, whereas 
those^pfomoted^from -gene-prediction -pro- 
grams average about 3.7 exons. The largest 
number of exons that we have identified in a 
transcript is 234 in the titin mRNA. Table 8 
compares the amounts of evidence that sup- 
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port the Otto and other predicted transcripts. 

For example, "one can see that a typical Otto 
.transcript has 6.99 of its 7.81 exons supported 
;by protein homology evidence. As would be j 
; expected, the Otto transcripts generally have 
vmore support than do transcripts predicted by « 

the de novo methods. r 



.4 Genome Structure , 

i'^H^woiy.^Xhis". section! describes several of " 
• Jhe^onco^ j 
genome sequence and their correlations with 
■ the predicted gene set These include an anal- 
ysis of G+Ccontent and gene'density in the 
: context of cytogenetic maps of the genome, 
. an eriumerative analysis of CpG islands, and 
s a brief description of the genome-wide repet- 
itive elements. \ 



4.1 Cytogenetic maps 
Perhaps the most obvious, and certainly the 
i;*most visible,- element: of ;the structure of 
h ;the genome is the banding pattern produced 
by" Giemsa ; stain.- Chromosomal banding 
? studies have revealed that about 17% to 
•20% of the human chromosome comple- 
ment, consists., of C-bands, or constitutive 
heterochromatin (^),:Muc>.of this hetero- 
chromatin is.mghly . polymorphic 'and con^ 
/Sists of different families ;.of alriha: satellite 
..; DNAs >with. ;various ; higher /order : repeat* 
>/ : structures {65). - Many .chromosomes have 
•/.complex inter- and intrachromosomal du- 
-plications present in pericentromeric re- 
gions (66). About 5% of the sequence reads 
./ were identified as alpha satellite sequences; 
v.these were -not included in the assembly! 




B Otto (homology) 

□ Otto (RefSeq only) 

□ Genscan 



JIUl 



1 2 3 



d, n, n, 



n, n. 
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Number of predictions per RefSeq transcript 

wfh^v ° f S ? Ut &*T K$mng from Afferent annotation methods. A set of 4512 
SctiSSTp^ °k RefS r Cq tr r^ iptS t0 the « enomic assembly were chosen (see the Srt 
™ ,22! u i h B 6 ?c Umb t' 5 0f we^PPing Genscan. Otto (RefSeq only) annotations based solelv 

ZSnuVl^ e ^ enCe \ Censcan) were These data.show.the degree to which 
multiple Genscan predictions and/or. Otto annotations were assoaated With a sinL rXo 

^^^rz^ w ^ wrx to the Refseti and *« - oti s 







Total 




Types of evidence 




No. of lines of evidence^ 










Mouse 


Rodent 


Protein 


Human 


2=1 


2=2 


2=3 


2=4 


Otto 


Number of 
transcripts 


17.969 


17.065 


14.881 


15.477 


16.374 


17 f 968f 


17.501 


15.877 


12.451 


De novo 

No. of exons per 
transcript 


Number of 

exons 
Number of 

transcripts . 
Number of 

exons 
Otto 
De novo 


141,218 

58,032 

319,935 

7.84 
5.53 


111.174 

14,463 

48.594 

5.77 
3.17 


89.569 

5.094 

19,344 

6.01 
3.80 


108.431 

8.043 

26.264 

6.99 
3.27 


118,869 

9,220 

• 40,104 

724 
436 


140,710 

21 t 350 

79,148 

7.81 
3.7 


127,955 

8.619 

31.130 

7.19 
356 


99.574 

4.947 

17,508 

6.00 
3.42 


59.804 

1.904 

6.520 

428 
3.16 


corcidered to ^LSJSSS 1^1^^^ S^T « "** W ' S " d *** » *™ were 
number .ndudes alternative splice forms of the 1 7. Z "mt£n^ tx!^ "* 3 partia ' match to 3 ** ■•» ° f I™*** tThfc 
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Examination of pericentromeric regions is 
ongoing. 

. The remaining —80% of the genome,..the 
. f euchromatic component, is . divisible into G-, . 
. . " R-, and T-bands (67). These cytogenetic bands 
have been presumed to differ in their nucleotide 
composition and gene density, . although .we 
. . . have been unable to determine precise band 
. boundaries at the molecular level. T-bands. are - : 
the most G+C- and gene-rich, and G-bands are 
G+C-poor (tfSJ.Bernardi has also offered a 
. description of the euchromatin at the molecular : 
level as long stretches of DNA of differing base . 
composition, termed isochores (denoted L, HI, 
H2, and H3), which are >300 kbp in length 
(69). Bernardi defined the L (light) isochores as 
G+C-poor (<43%), whereas the H (heavy) 
isochores M.into three G+C-rich classes rep-' 
- resenting 24,,8, and 5% of the genome. Gene } 
/concentration has been claimed to be very. low . -_. 
. in the L isochores and 20-fold more enriched in 
the H2 and H3 isochores (70). By examining 
contiguous 50-kbp windows of G+C content } : 
^across the assembly, we found that regions of . . 
G+C content >48% (H3 isochores) averaged 
273.9 kbp in length, those with G+C content 
between 43 and 48% (HI +H2 isochores) aver- 
aged 202.8 kbp in length, and the average span 
of regions with <43% (L isochores) was 
1078.6 kbp. The correlation between G+C 
: content and gene density was also examined in . 
. 50-kbp, windows along the assembled sequence , ■•■ 
. (Tjble 9 and Figs. 10 and 11). We found that , 
the density of genes was greater in regions of 
high G+C than in regions of low G+C content, 
as expected However, the correlation between 
G+C content and gene density was not as 
skewed as previously predicted (69). A higher 
proportion of genes were located in the G+C- 
poor regions than had been expected. 

Chromosomes 17, 19, and 22, which have 
a disproportionate number of H3-containing 
bands, had the highest gene density (Table 
10). Conversely, of the chromosomes that we 
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found to have the lowest gene density, X, 4, 
' .18,- 13, and Y, also have the fewest H3 bands. 
. Chromosome « 15, which .also ..has ; few. ,H3 
bands, did not have a particularly low gene 
. density in our analysis! In addition, chromo- 
some 8, which we found to have a low gene 
.density, does not : -appear, to be .unusual in its 
'! H3 banding.* . . .'- .\ . t 

How : valid as Ohno's postulate (71) that 
^mammalian genomes consist of oases of genes 
. in otherwise essentially empty, deserts? It ap- 
: pears that the. human genome does, indeed con- 
tain deserts,- or large,' gene-poor regions. If we 
..cleflne a desert as a region >500 kbp without a 
/gene, then we see that.605,Mbp, or about .20% 
/ of me. genome, .is in.-deserts. These •. are. not., 
.uniformly distributed over the various chromo- ... 
... somes. Gene-rich'chromosomes 17, 19, and 22 
have only about 12% of their, collective 171: 
, .Mbp in deserts, whereas gene-poor chromo- 
somes 4, 13, 18, andX have 27.5% of their 492 
. Mbp in deserts (Table 1 1). The apparent lack of 
predicted genes , in .these regions does not nec- 
essarily imply that they are devoid of biological ' 
function. 

4.2 Linkage map 

Linkage maps provide the basis for. genetic 
r _ analysis and are widely used in the study of the . 
inheritance of traits and in the positional clon- 
ing of genes. The distance metric, centimorgans , 
(cM), is based on the recombination rate be- 
tween homologous chromosomes during riieio-. ; 



sis... In. general, .the rate of recombination in 
.females is-greater-man t^t m males, and this 
: degree, of map expansion is not uniform across 
the .genome (72). One . of the opportunities en- 

- »> abledby a nearly complete genome sequence is 
. .;to. produce the ultimate physical map, and to 
.A'fclly. analyze .its correspondence with two other 

: maps that have been -widely... used .in ' genome 
, , and genetic, analysis the linkage map and the 
cytogenetic map. This would close the loop 
. ;.between the mapping and sequencing phases of 
•i , the genome project. s 

: We mapped the location of the markers 
■ that constitute the Genethon linkage map to 
the genome. The rate, of recombination, ex- 
. . pressed ;as cM. per:,Mbp, was calculated for 
. ; : 3-Mbp windows as shown in Table 12. High- 
• er. jates f of recombm^n*on . m ; :the telbmeric 
.~;region'of the chromosomes have been preyi-. 
oiisly - documented :(75).\.Frpm this mappmg 
result, there is a difference of 4.99 between 

- lowest rates and highest rates and the largest 
■ . .difference pf ; 4.4 between males and females 

(4.99 to 0.47: on chromosome 16). This indi- 
cates that the variability in recombination 
rates among regions of the genome exceeds 
. .me differences m , recombination rates be- 
-tween males., and females.. The : human ge- , 
nome has recombination hotspots, where re- 
combination rates vary fivefold or more over 
>:a space of lkbp,so the picture/one gets of the .' 
magnitude of > variability in recombination 
. rate will depend on ■ the size of the window '!•• 



Table 9. Characteristics of C+C in isochores. 



Isochore 



C+C (%) 



Fraction of genome 



Fraction of genes 



Predicted* 



- Observed 



• . Predicted"*"" 



Observed 



H3 

H1/H2 
L 



>48 
43-48 
<43 



5 
25 
67 



9.5 
21.2 
69.2 



37 
32 
31 



24.8 
26.6 
48.5 



♦The predictions were based on Bemardi's definitions (70) of the isochore structure of the human genome. 



Fig. 9. Comparison of 
the number of exons 
per transcript between 
the 17.968 Otto tran- 
scripts and 21350 de 
novo transcript predic- 
tions with at least one 
line of evidence that 
do not overlap with an 
Otto prediction. Both 
sets have the highest 
number of transcripts 
In the two-exon cate- 
gory, but the de novo 
gene predictions are 
skewed much more 
toward smaller tran- 
scripts. In the Otto set 
19.7% of the tran- 
scripts have one or 
two exons. and 5.7% 
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have more than 20. In the de novo set. 49.3% of the transcripts have one or two exons. and 0.2% have more than 20. 
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c^T Unfortunately, too few meib^ 
crossovers have occurred in Centre d'Etude 
flu Polymorphism Humab (CEPH) and other 
reference families to provide a resolution any 

s fmer.than about 3 Mbp; The next challenge 
will ; be to determine .a.- sequence basis- of. 

• rec ombmation at the chromosomal level An ' 
accurate predictor for the rate for variation in 

recombmation. rates between any pair of • 

markers would be extremely useful in design- 
;mg markers to narrow a region .of linkage 
such as in positional cloning projects ' ' 
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4.3^ Correlation between CpG islands 

and genes: V: 

: CpG islands are stretches of unmethylated - 
DNA with a higher frequency of CpG 
dinucleotides when compared with the entire 
genome (74). CpG islands are believed to 
preferentially occur at the transcriptional start 
of genes, and it has been observed that most v: 
housekeeping genes have CpG islands at the 
5 end of the.transcript (75, 76). In addition, 
experimental evidence indicates that CpG is- 
land methylation is correlated with gene in- 
activahon (77) .and has been shown to be 
important during gene imprinting (73) and . 
tissue-specific gene expression (79) 

Experimental methods have been used 
that resulted in an estimate of 30 000 to 
^7/Pj isl -d S in me human genome 
(/4 SO) and an estimate of 499 CpG islands 
°J J C ^ omosome 22 (81), Larsen et 

al (76) and Gardiner-Garden and Frommer 
(75) used a computational method to iden- 
tify CpG islands and defined them as re- 
gions of DNA of >200 bp that have a G+C 
content of >50% and a ratio of observed 



versus expected frequency of CG dinucle- 
otide S:0.6. 

It is difficult to make a direct compari- 
son of experimental definitions of CpG. is- 
lands -with computational definitions be- 
cause/computational methods do not con- : 
sider.the methylation state of cytosine and 
expenmental methods do not directly select • 
regions of high G+C cbntent. However, we • 
.can.detennine.the correlation of CpG island ' 
...with .gene istarts, ?g iyen;a:« et .ofanhbtated ! 
genomic transcripts and the whole genome 

/ ayailable.annotatibn of chromosome 22 W 
■well as using the. entire. human genome in 
our assembly and. the computationally an- 
notated genes. A variation of the CpG is- ''■ 
land computation was compared with 
Larsen et at., (7d>The main differences are ' 
that we use a sliding window of 200 bp 
consecutive -.windows are merged only if ; 
they /overlap,, and we recompute the CpG 
value upon merging, thus rejecting any po- 
tential island if it scores less than the : 
threshold. 

To compute,; various CpG statistics, we 
used two different thresholds of CG dinucle- 
otide.likelihood ratio. Besides using the orig- '. 
inal threshold of 0.6 (method 1), we used a 
higher.threshold.of CG dinucleotide likeli- 
hood ratio of 0.8 (method 2), which results in 
the number of CpG islands on chromosome 
22 close to the number of annotated genes on 
this chromosome. The main results are sum- ' 
rnanzed in Table 13, CpG islands computed , 
with method 1 predicted only ;. '2.6% of the 
CSA sequence as CpG, but 40% of the gene '. 
starts (start codons) are contained inside a 



CpG island. This is comparable to ratios re- 
ported by others (82). The last two rows of 
the table showi.the. observed. and expected 
* dl ?^e, respectively,^ the closest 

: v;CpG : island.from thefirst exoaiThe observed 
average closesfCpG islands are smaller than 
; • the. corresponding expected 'distances, con- 
. finning an association between. CpG island 
;.. and the first exon....- \: '*.• : 
^Wfi]^ip^^^^taiaou of CpG > 

y : vclasses:such,as ; mtergemc:regions,:intions^^ ; 
uexom :,; and.first. exons^.computed the •? 
, likelihood score for each sequence class as ■ 
: the ratio of the observed faction of CpG 
island nucleotides, in that sequence class 
and the expected, fraction of CpG island 
. nucleotides, in that sequence class! The re- 
; sult of applying method ,1 > on CSA were 
S ^W^enic region, 1.2 for ~ 
rmtron, 5.86 for exon, <and .13.2 for first 
,exon. The: same trend was also- found for ■- 
chromosome 22 and after the application of 
higher threshold (method 2) on both data 
> sets. In ,sum,- genome-wide analysis has 
extended earlier analysis and : suggests a 
strong correlation between CpG islands and 
. .first coding exons. 




! % of genome 
□ % of genes 



30-35 % 35-40% 40 -4 5% ^o" % ■ ~ • 5 ~ 0% '^^ 



genome (in S0-kb P windows) SfiShdtaS g£E2L^ JS" bar$ sh<W the P ercent ° f 
genes associated with each G+C bin is T*$- ^ percent of *« total nu "i°er of 

W of the genome has a C+cS^f^^^J f f?^^^^^^ 
nearty 15% of the genes. between 50 and 5S% - but that this portion contains 



4.4 Genome-wide repetitive elements 

The proportion of the genome covered by 
various classes of repetitive DNA is present- 
ed in Table 14. We. observed about 35% of 
< :.fte genome.in these repeat classes, very sim- 
ilar to values reported previously (83). Repet- 
• Ave sequence may be.'undeirepresented in 
the Celera assembly as a result of incomplete 
repeat resolution/as discussed above About 
8% of the scaffold length is in gaps, and we 
expect that much of this is repetitive se- 
quence. Chromosome 19 has the highest re- 
r -peat density (57%> r as-wellas the highest- ~ 
: gene density (Table 10). Of interest, among 
. the different classes of repeat elements, we 
observe a clear association of Alu elements 
and gene density, which was not observed 
between LINEs and gene density. 

5 Genome Evolution 

Summary. The dynamic nature of genome 
evolution can be captured at several levels 
These include gene duplications mediated by 
RNA intermediates (retrotransposition) and 
segmental genomic duplications. In this sec- 
tion, we document the genome-wide occur- 
rence of retrotransposition events generating 
functional (intronless paralogs) or inactive 
genes (pseudogenes). Genes involved in 
translational processes and nuclear regulation 
account for nearly 50% of all intronless para- 
logs and processed pseudogenes detected in 
our survey. We have also cataloged the extent 
of segmental genomic duplication and pro- 
vide evidence for 1077 duplicated blocks 
covering 3522 distinct genes 
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Fig. 11 (continued). Relation among gene density (orange), G+G content 
(green) EST density (blue), and Alu density (pink) along the lengths of 
each of the chromosomes. Gene density was calculated in 1-Mbp win- 



dows. The percent of G+C nucleotides was calculated in 100-kbp 
windows. The number of ESTs and Alu elements is shown per 100-kbp 
window. r 



5.1 Retrotransposition In the human 
genome 

Retrotransposition of processed mRNA 
transcripts into the genome results in func- 
tional genes, called intronless paralogs, or 
inactivated genes (pseudogenes). A paralog 
refers to- a gene that appears in more than 
one copy in a given organism as a result of 



a duplication event. The existence of both 
intron-containing and intronless forms of 
genes encoding functionally similar or 
identical proteins has been previously de- 
scribed (84, 85). Cataloging these evolu- 
tionary events on the genomic landscape is 
of value in understanding the functional 
consequences of such gene-duplication 



events in cellular biology. Identification of 
conserved intronless paralogs in the mouse 
or other mammalian genomes should pro- 
vide the basis for capturing the evolution- 
ary chronology of these transposition 
events and provide insights into gene loss 
and accretion in the mammalian radiation. 
A set of proteins corresponding to all 901 
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Otto-predicted, single-exon genes were suo 
jected to BLAST analysis against the proteins 
encoded by the remaining multiexon predict- 
ed transcripts. Using homology . criteria of 
70% : sequence identity over 90% of the 
. length, we identified 298 instances of single- 
; to multi-exon correspondence. Of these 298 ; 
' ; sequences, 97 were represented in the Gen- 
Bank data set of experimentally validated 
, full-length genes at the stringency specified "* 
and were verified by .maLnual inspection, 
j .We beheve. that these 97: cases may rep^ 
resent intronless paralogs (see Web table 1 on . 
Science Online at www.sciencemag.org/cgi/ 
content/rull^91/5507/1304/DCl) of known 
genes. Most of these are flanked by direct 
repeat sequences, although the precise nature 
of these repeats remains to be determined. All 
of the cases for which we have high confi- ; 
. dence contain polyadenylated [poly(A)J tails ": 
characteristic of retrotransposition. 

Recent publications describing the phe- 
nomenon of functional intronless paralogs . 
speculate that retrotransposition may serve as 
a mechanism used to. escape X-chromosomal 
mactivation (84, 86). We do not find a bias 
toward X chromosome origination of these : 
retrotransposed genes; rather, the results 
show a random chromosome distribution of 
both the mtron-containing and corresponding 
intronless paralogs. We also have found sev- 
eral cases of retrotransposition from a single 
source chromosome to multiple target chro- 
mosomes. Interesting examples include the 
retrotransposition of a five exon-containing . 
ribosomal protein- L2 1 gene on chromosome 
13 onto chromosomes 1, 3, 4, 7, 10, and 14 
respectively. The size of the source genes can' 
also show variability. The largest example is 
the 31-exon diacylglycerol kinase zeta gene 
on chromosome 11 that has an intronless 
paralog on chromosome 13. Regardless of 
- route, > retrotransposition with subsequent 
gene changes in coding or noncoding regions 
that lead to different functions or expression 
patterns, represents a key route to providing 
an enhanced functional repertoire in mam- 
mals (87). 

Our preliminary set of retrotransposed in- 
tronless paralogs contains a clear overrepre- 
sentatiori of genes involved in translation^ 
processes (40% ribosomal proteins and 10% 
translation elongation factors) and nuclear 
regulation (HMG nonhistone proteins, 4%) 
as well as metabolic and regulatory enzymes! 
EST matches specific to a subset of intronless 
paralogs suggest expression of these intron- 
less paralogs. Differences in the upstream 
regulatory sequences between the source 
genes and their intronless paralogs could ac- 
count for differences in tissue.-specific gene 
expression. Defining which, if any, of these 
processed genes are functionally expressed 
and translated will require further elucidation 
and experimental validation. 
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5.2 Pseudogenes 
- Table /Genome overview. * 



Size, of the genome (including gaps) 

Size of the genome (excluding gaps) 
. . Longest contig ";• . O 

^ Longest scaffold ' •. V 
i-v Percent of A+Xin the i genome" 
a Percent of G+C in the genome 

Percent of undetermined bases in the genome 

Most GC-rich 50 kb 

LeastGC-rich 50 kb 

Percent of genome classified as repeats 

Number of annotated genes 

Percent of annotated genes with unknown function 

Number of genes (hypothetical and annotated) 

rlT^l ^P° thetlcal and annotated genes, with unknown function 

Gene with the most exons : 

Average gene size . 

Most gene-rich chromosome 

Least gene-rich chromosomes 



Total size of gene deserts (>500 kb with no annotated genes) 
Percent of. base pairs spanned by genes : 
Percent of base pairs spanned by exons : 
Percent of base pairs spanned by introns 
Percent of base pairs in intergenic DNA 

Chromosome with highest proportion of DNA in annotated exons 
Chromosome with lowest, proportion of DNA in annotated exons 

wsBBiy* annotated + hypotheto ' —4 



; . 2.91 Gbp 
'.:;.V i :. , 2.66.Gbp -. • 
- .1.99 Mbp 
. 14.4 Mbp 
54 

•38* : 

VChr. 2(66%) - 

Chr.X(25%). 
.35 

26,383 
.42 

39/114 

59 - • : ' ; : 

; ritin (234 exons) 

27kbp 
; ;;Chr. 19 (23 genes/Mb) 
Chr. 13 (5 genes/Mb), 
, Chr. Y (5 genes/Mb) .. 
\ 605 Mbp 

25.5 to 37.8* 
. 1.1 to 1.4* 

24.4 to 36.4* 

74.5 to 63.6* 
Chr. 19 (9.33) 
Chr. Y (036) 

Chr. 13 (3,038,416 bp) 
1/1250 bp 



1" hypothetic3 , + 



Chrom. 



Male 



Sex-average 



Max. Avg. 



1 
2 
3 
4 
5 
6 
7 
8 
9 
10 

n 

12 
13 
14 
15 
16 
17 
18 
19 
20 
21 
22 
X 
Y 

Genome 



2.60 
.2.23 
2.55 
1.66 
2.00 
1.97 
2.34 
. 1.83 
2.01 
3.73 
1.43 
4.12 
1.60 
3.15 
2.28 
1.83 
3.87 
3.12 
3.02 
3.64 
3.23 
1.25 
NA 
NA 
4.12 



1.12 
0.78 
0.86 
0.67 
0.67 
6.71 
1.16 
0.73 
0.99 
1.03 
0.72 
0.76 
0.75 
0.98 
0.94 
1.00 
0.87 
137 
0.97 
0.89 
126 
1.10 
NA 
NA 
0.88 



Min. 


Max. 


Avg. 


Min. 


0.23 


2.81 


1.42 


0.52 


0.33 


2.65 


1.12 


0.54 


0.23 


2.40 


1.07 


0.42 


0.15 


2.06 


1.04 


0.60 


0.18 


1.87 


1.08 


0.42 


0.28 


2.57 


1.12 


037 


0.48 


1.67 


1.17 


0.47 


0.14 


2.40 


1.05 


0.46 


0.53 


1.95 


132 


0.77 


0.22 


3.05 


1.29 


. 0.66 


031 


2.13 


0.99 


0.47 


0.26 


335 


1.16 


0.49 


0.01 


1.87 


0.95 


0.17 


0.18 


2.65 


130 


0.62 


034 


231 


122 


0.42 


0.47 


2.70 


1.55 


0.63 


0.00 


3.54 


135 


0.54 


0.86 


3.75 


1.66 


0.43 


0.10 


2.57 


1.41 


0.49 


0.00 


2.79 


1.50 


0.83 


0.69 


237 


1.62 


1.08 


0.84 


1.88 


1.41 


1.08 


NA 


NA 


NA 


NA 


NA 


NA 


NA 


NA 


0.00 


3.75 


122 


0.17 



Female 



Max V, Avg. Min. 



3.39 
3.17 
2.71 
2.50 
226 
3.47 
2.27 
3.44 
2.63 
2.84 
3.10 
2.93 
2.49 
3.14 
2.53 
4.99 
4.19 
435 
2.89 
331 
^58 
3.73 
3.12 
NA 
4.99 



1.76 

1.40 

130 

1.40 

1.43 

1.67 

121 

136 

*1.66 

1.51 

132 

1.55 

1.19 

1.63 

1.56 

232 

1.83 

224 

1.75 

2.15 

130 

2.08 

1.64 

NA 

1.55 



0.68 

0.61 

033 

0.77 

0.62 

0.64 

034 

0.43 

0.82 

0.76 

0.49 

0.59 

032 

0.75 

0.54 
1.12 

0.94 

0.72 

0.87 

134 

1.18 

0.93 

0.72 

NA 

032 
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. . that account for. gene, inactivation. .The gen-, 
eral structural characteristics of these pro- . 
■■ cessed pseudogenes. include J . the complete 
/lack of intervening sequences found in the . 
functional counterparts, a poly(A) tract at the 
3' end, and direct repeats flanking the pseu- 
\ ^dogene sequence. Processed pseudogenes oc-.< 
; . cur as a result of retrotransposition, whereas c 
: unprocessed pseudogenes arise from segmen-> 
tal genome duplication. y 

We searched the complete set of Otto- . 
• predicted transcripts against the. genomic se- 
..quence by means of BLAST.: Genomic re- 
: gions corresponding . to . all -Otto-predicted 
, transcripts were excluded from this analysis. ; 
.We identified 2909 regions matching \vith ■', 
greater than 70% identity over at least .70% of> 
the length of the transcripts that likely repre- f 
sent processed pseudogenes. .This number is 
^ probably an underestimate because specific 
methods to search for pseudogenes were not 
used. 

. We looked for ; correlations between 
„ structural elements and the propensity for 
retrotransposition in the human genome. 
GC content and transcript length were com- 
pared between the genes with processed 
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v pseudpgenes ; j(l 177.' source, genes) s versus 
•Nthe remainder iof . the predicted gene set. 
: ■Transcripts that, give rise to processed pseu- 
dogenes have shorter . average - transcript 
length (1027 bp versus .1594 bp for the Otto 
■set) as .compared with genes for which no, 
Vpseudpgene was detected: The overall GC; 
content .did not show: any significant differ- 
ence, contrary to a recent report (88), There 
is a clear trend in gene families that .are 
) present ; as processed v pseudogenes. • These 
/include ribosomal "proteins (67%), lamin 

• receptors (10%), translation elongation fac- 
tor alpha (5%), and HMG-non-histone pro- 
teins (2%). The increased ^occurrence : of 

ciTetrotransposition (both intronless paralogs 
. 'and processed pseudogenes), among genes , 

• involved in translation and nuclear regula- 
tion may reflect anincreased, transcription- j 
al activity of these genes. 

5.3 Gene duplication in the human" 
genome 

Building on a previously published procedure 
(27), we developed a graph-theoretic algo- 
rithm, called Lek, for grouping the predicted 
.human protein set into protein families (89). 



Table 13. Characteristics of CpG islands identified in chromosome 22 (34-Mbp sequence length) and the 
whole genome (2.9-Cbp sequence length) by means of two different methods. Method 1 uses a CG 
likelihood ratio of &0.6. Method 2 uses a CG likelihood ratio of 5:0.8. . . 



.:. Chromosome 22 



Whole genome 
(CS assembly) 





Method 1 Method 2 


Method 1 


. Method 2 


Number of CpG islands 


5,211 


522 


195.706 


26,876 


detected 








Average length of island (bp) 


390 


535 


395 


497 


Percent of sequence 


5.9 


0.8 


2.6 


0.4 


predicted as CpG 










Percent of first exons that 


44 


25 


42 


22 


overlap a CpG island 










Percent of first exons with 


37 


22 


40 


21 


first position of exon 










contained inside a CpG 










island 










Average distance between 


1,013 


10.486 


2,182 


17,021 


first exon and closest CpG 








island (bp) 










Expected distance between 


3,262 


32,567 


7,164 


55,811 


first exon and closest CpG 






island (bp) 










Tabte 14. Distribution of repetitive DNA in the 


compartmentalized shotgun assembly sequence.^ . 






Megabases in . 


. Percent 


Previously 


Repetitive elements 




assembled 


of 


predicted 






sequences 


assembly 


(%) (83) 


Alu 




288 


9.9 


10.0 


'Mammalian Interspersed repeat (MIR) 




66 


2.3 


1.7 


Medium reiteration (MER) 




50 


1.7 


1.6 


Long terminal repeat (LTR) 




155 


5.3 


5.6 


Long Interspersed nucleotide element 




466 


16.1 


16.7 


(LINE) 










Total 




1025 


35.3 


35.6 



result from the 
Lek clustering provide one basis for compar- 
ting the role of whole-genome or chromosom- 
al duplication in protein family expansion as 
; opposed to other means, such as tandem du- 
. ^' .plication, because eack complete cluster rep- 
:/ v resents % a closed and. certain island of homol- ' 
because, Lek is capable^pf rSimulta-} . 
; '/, nebusly -clustering ^protein /complements of 
^.several .organisms; ; the numbenof -proteins 
,j ^contributed by. each, organism to >:a. complete 
^ cluster can „be predicted with, confidence de- 
spending on the quality of. the. annotation of 
.each genome. The .Variance of each organ- 
ism's .contribution to. each cluster can then be 
'^calculated, allowing an assessment of the rel- 
;c: ative ; importance of ,;large-scale . ;duplication 
versus ■ smaller-scale^ organism-specific ex- 
;>;vpansion and. contraction , of protein families, 
.'presumably -as a result* of natural selection 
. operating on individual protein families with- 

• in an organism. As can be seen in Fig. 1 2, the 
; . large variance in the relative numbers of hu- 

. : man as compared with D. melanogaster and 
Caenorhabditis elegans proteins in complete 

• clusters may be explained by multiple events 
of relative expansions in gene families in 
each of the three animal genomes. Such ex- 
pansions would , give rise . to the distribution 
that shows a . peak . at 1:1 in the ratio for 

; human-worm or ;human-fly clusters with the ■'. 

./slope spread covering bo^ human .and fly/ . 
.worm, predominance, as \ we observed (Fig. 
12). Furthermore, there are nearly as many 
clusters where worm and fly proteins pre- 
dominate despite the larger numbers of pro- 
teins in the human. At face value, this anal- 
ysis suggests that ; natural selection acting on 

, individual protein families has been a major 
force driving the expansion of at least some 

. • elements of the human protein set. However, 
in our analysis, the difference between an 
ancient whole-genome duplication followed 
by loss, versus piecemeal duplication, cannot 
be easily distinguished. In order to differen- 
tiate these scenarios, more extended analyses 
were performed. 

5.4 Large-scale duplications 

Using two independent methods, we 
searched for large-scale duplications in the 
human genome. First, we describe a protein 
family-based method that identified highly ■ 
conserved blocks of duplication. We then 
. describe our comprehensive method for identi- 
fying all interchromosomal block duplications. 
The latter method identified a large number of 
duplicated chromosomal segments covering 
parts of all 24 chromosomes. 

The first of the methods is based on the 
idea of searching for blocks of highly con- 
served homologous proteins that occur in 
more than one location on the genome. For 
this comparison, two genes were considered 
equivalent if their protein products were de- 
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termined to be in the same family and tl 
same complete Lek cluster (essentially 
paralogous genes) (89). Initially, each chro- 
mosome was represented as a string of genes 
. ordered by , the .start, codons for .predicted 
, genes along the chromosome. We considered 
the two .strands as a single string, .because 
/.local inversions are relatively .common events 
relative to large-scale duplications. Each 
gene was indexed according to the protein 
l; family, and Lek: complete cluster (89), All 
; - pairs, of : . indexed * gene,; strings : were then 
-aligned in both the forward and reverse di- 
rections with the Smith-Watennan .algorithm 
(90). A match between 'two proteins of the 
same Lek complete cluster was given a score 
of 10 and a mismatch -10, with gap open 
and extend penalties of -4 and -I. With 
:■ these parameters, 19. conserved interchromo- ' 
. .somal blocks of duplication were observed, T 
all of which were also detected and expanded 
by the comprehensive method described be- 
: low. The detection of only a relatively small I 
number of k block, duplications was a conse- 
quence of using an intrinsically conservative ; 
\ , method grounded in the conservative con- 
straints of the complete Lek clusters. . , • . . 
In the second, more comprehensive ap- 
• proach, we aligned all chromosomes directly 
with one another using an algorithm based on 
the MUMmer system (91). This alignment 
method uses a suffix tree data structure and a 
linear-time algorithm to align long sequences 
very rapidly; for example, two chromosomes 
of 100 Mbp can be aligned in less than 20 
min (on a Compaq Alpha computer) with 4 
gigabytes of memory. This procedure . was 
used recently to identify numerous large- 
scale segmental duplications among the five 
chromosomes of A. thaliana (92); in that 
organism, the method revealed that 60% of 
the genome (66 Mbp) is covered by 24 very 
large duplicated segments. For Arabidopsis, & 
DNA-based ali^iment was sufficient to re- 
veal the segmental , duplications between 
chromosomes; in the human genome, DNA 
alignments at the whole-chromosome level 
are insufficiently sensitive. Therefore, a mod- 
ified procedure was developed and applied, 

EJSST-.?** a " 26 ' 588 Proteins 
(¥,675,713 million amino acids) were concat- 
enated end-to-end in order as they occur 
along each of the 24 chromosomes, irrespec- 
tive of strand location. The concatenated pro- 
tein set was then aligned against each chro- 
mosome by /the MUMmer algorithm The 
resulting matches were clustered to extract all 
sets of three or more protein matches that 
occur in close proximity on two different 
chromosomes (93); these represent the can- 
didate segmental duplications. A series of 
filters were developed and applied to remove 
likely false-positives from this set; for exam- 
ple, small blocks that were spread across 
many proteins were removed. To refine the 
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filtering methods, a shuffled protein set was 
first created by taking the 26,588 proteins, 
randomizing their order, and then partitioning 
them into .24 shuffled chromosomes, each 
containing the same number of proteins as .the 
:\: true genome/This shuffled, protein set has the 
.ridentical^composition,to.the real genome; in 
- particular, .every .protein and every domain 
appears the same number of times.-The com- 
plete algorithm was then applied. to both the 
; real and the ^shuffled data,^with.the results on 
^^^H^ ed ^^^eing:used jo .estimate the ■ 
, false-positive rate.^e/algorithm after filter- 
. v .mg yielded 1 0,3 lO.gene pairs in 1077 dupli-v 
.-. seated blocks. containing 3522 distinct genes; : 
: :;tandemly duplicated expansions in many of 
the blocks explain the excess of gene pairs to 
distinct genes. : In .the shuffled/data, by con- 
trast, only 370 gene pairs were, found, giving . 
: : -.,a,felse-positive. ■ estimate of 3.6%V>The most 
- likely, explanation -for. the. 1077 block dupli- 
, cations is ancient segmental duplications. .In 
• many cases, the order of the proteins has been ; 
■ shuffled, although proximity is preserved C 
Out, of. the .1077 blocks,. 159-contain only : 
three, genes, 137 contain four genes, and 781 ■ 
„ contain five; or more genes. . - v : 
„ To illustrate the extent of the -detected ' 
duplications, Fig. .13 shows all 1077 block :' 
duplications indexed to each chromosome in 
24 panels in which only duplications mapped 
to the indexed chromosome are displayed. 
The.figure makes it clear that the duplications 
are-ubiquitous in the genome. One feature 
that it displays is many relatively small chro- • 
mpsomal stretches, with one-to-many dupli- , 
cation relationships mat are graphically strik- . " 
ing. One such example captured by the anal- 
ysis is the well-documented olfactory recep- - : ' 
tor (OR) family, which is scattered in blocks 
throughout the genome and . which has been 
analyzed for genome-deployment reconstruc- 



• . "as at several evolutionary stages (P4). The 
figure also illustrates that some chromo- 
somes, such as chromosome 2, contain many 
-more - detected large-scale duplications than 
:r;;others:;indeed,^ one- of the largest duplicated 
segments is a large vblock of 33 proteins on 
-^chromosome -2; spread among eight smaller 
;;r blocks in*2p, that aligns t6 a paralogous set on 
^chromosome 14, with one rearrangement (see 
% chromosomes 2 and t14 panels in Fig. 13). 
. --.The; proteins are not contiguous but span a 
^-region vcpntainmg^o^ 
.;Some:2 and 33.2 proteins on chromosome 14. 
r. The likelihood of observing this many dupli- 
cated proteins by chance, even over a span of 
this lengm, is 2.3 X10- 6 ^(P5). This dupl^ 
, cated set spans 20 Mbp on chromosome 2 and 
63 Mbp on chromosome 14, over 70% of the 
latter chromosome.- Chromosome 2 also con- 
ii -tains; a block; duplication. that is nearly as 
large, which is shared by chromosome arm 2q 
: : and chromosome 12;VThis duplication incor- 
; porates - two of the four ..known Hox gene 
; : : clusters, but considerably expands the extent 
-of the duplications proximally and distally on 
:the pair of chromosome arms. This breadth of 
> duplication is also seen on the two chromo- 
somes carrying the other two Hox clusters. 

An additional large duplication, between 
chromosomes 18 and 20, serves as a good 
example to illustrate some of the features 
common to many of the other observed large 
duplications (Fig. 13, inset): This duplication 
contains 64, detected ordered intrachromo- 
somal pairs of homologous genes. After dis- 
counting a 40-Mb stretch of chromosome 18 
free of matches to chromosome 20, which is 
/ likely to represent a large insert (between the 
gene assignments "Krup rel" and "collagen 
rer.on chromosome 18 in Fig. 13), the full 
duplication segment covers 36 Mb on chro- 
mosome 18 and 28 Mb on chromosome 20. 



B Human/Worm 
■ Human/Fly 




5:1 4:1 3:1 
human predominant 



Ratio 



1:3 1:4 1:5 
fly/worm predominant 



number) of human versus worm and human versus fly proteins per duster were plotted ( 
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By this measure, . the duplication segment ■>;-. . pair .of duplicated ^chromosome regions was 
spans nearly half of each chromosome's net,!: -observed in many compared regions. rHypothe- 



ses : to explain which mechanisms foster these 
. processes must be tested. 
, 7. .'. .Evaluation . of ,the , alignment results .; gives 
I some perspective on.dating of the. duplications: 
:'..As;noted above,- large T scale. ancient segmental 



length: The most.likely'scenario is that, the ? 
: whole span of this region was duplicated as a 

' single very large block, followed by shuffling 

..owing to smaller scale rearrangements. As ; ; 
. - : such, at least four subsequent, rearrangements. 

" ' would need to -be invoked, to explain.^the ^duph^tion m-facbl^/e^lains jmany of the 
\ I relative insertions and inversions seen' m; the J^.blpc^ 

^.duplicated segment interval The ,64 protem ^/nie^re^ons of human chromosomes involved 
pairs in this alignment.occur among 217 pro-: ; : ^m, me' large-scale -o^ upon 
tein assignments on chromosome 18, f and j above (chromosomes 2 to 14, 2 to 12,- and 18to 
; 1 among 322 protein assignments on chromc^.l- ; v.20) areieach syntenic.toa distinct mouse chro- 

, some 20, for a density of myolyed proteins of ; ;mpsomal .re^ 
: . 20 to 30%,-This is consistent with an ancient ^chromosomal regions are much more similar in 
.. large-scale duplication followed, by . subse-! ; > .sequence conservation, and even in order, to 
r& quent gene loss on one or both chroiriosomes. : : their iiuman synteny partners .than ^the human 
Loss of just one member of a gene pair duplication regions, are to each other. Further' 
, subsequent to the duplication would result in the corresponding mouse chromosomal regions 
. a failure to score a gene pair in the block; less . each, bear a significant proportion of genes or-., 
than 50% gene loss on the chromosomes IV-, thologous. to the' human genes -on which the : 
• would lead to the duplication density ob- : - human duplication assignments were made. On 
served here. As" an independent verification the. basis, of mese. factors,; me. corresponding 
. . of the significance of the alignments detect- r \ ^mouse, chromosomal spans, at coarse resolu- 
. ed, it can be seen that a substantial number of > . 
the pairs of aligning proteins in this duplica- ; ... 
tion, including some of those annotated (Fig. 



; ^ : veal the. stagewise ttsfory of our genome, and 
•* £ .with it a ru^ory bYthV emergence of many of 
: :* - the key .furictions'that distinguish us from other 
v> living things. • 

• : .^,'6 A Genome-Wide Examination of \ 
-^Sequence Variations 

. ::;:Summa^ methods were us& '="> 

^oiidentify^ 

(SNPs) by comparison of the Celera sequence 
to other SNP resources The SNP rate be- 
tween two. chromosomes was — 1 per 1 200 to . 
w,1500 bp. SNPs are distributed nonrandomly ■ 
/ throughout the.;gehome. Only a very, shiall • 
.; r-; proportion fof all .'SNPs (< 1%) potentially . 
'. ^impact protein ^function based on the func- 
^••tioriali analysis of SNPs that affect the pre* 
/:\ dieted >coding regions. '.This results in an cs» 
tStimate that only thousands, not millions, of 
/ ; genetic variations rriay contribute to the struc* 
v } rural, diversity of human proteins. 



,13), are those populating small Lek complete 
clusters (see above). This indicates that they 
are members of very small families of para- 
-.: logs; their relative scarcity within the genome 
. validates the uniqueness and robust nature of 
. their alignments. V. 

Two additional qualitative features were ob- 
served among many of the large-scale duplica- 
tions. First, several proteins with disease asso- 
ciations, with OMIM (Online Mendelian Inher- 
itance in .Man) assignments, are members -of . 
duplicated segments (see web table 2 on Sci- 
• ence Online at www.sciencemag.org/cgi/con- 
; tent/fulV291/5507/1304/DCl). We have also 
observed a few instances where paralogs on 
both duplicated segments are associated with 
similar disease conditions. Notable among 
these genes are proteins involved in hemostasis 
(coagulation factors) that are associated with 
bleeding disorders, transcriptional regulators 
like the homeobox proteins associated with de- 
velopmental disorders, and potassium channels 
associated with cardiovascular conduction ab- 
normalities. For each of these disease genes, 
closer study of the paralogous genes in the 
duplicated segment may reveal new insights 
, into disease causation, with further investiga- 
. tion needed to determine whether they might be 
involved in the same or similar genetic diseases. 
Second, although there is a conserved number 
of proteins and coding exons predicted for spe- 
cific large duplicated spans within the chromo- 
some 1 8 to 20 alignment, the genomic DNA of 
chromosome 18 in these specific spans is in 
some cases more than 10-fold longer than the 
corresponding chromosome 20 DNA. This se- 
lective accretion of noncoding DNA (or con- 
versely, loss of noncoding DNA) on one of a 



tion, appear to be products of the same large- \ 
vjscale duplications observed in humans. Al- ; 
X though' further detailed analysis must be carried 
out once a more complete genome is assembled 
. for mouse, the underlying large duplications . 
/. appear .to predate the two species'- divergence. - 
.This dates the duplications, at the latest, before 
» divergence of the primate and rodent lineages. , . 
This date can be further refined upon examina- 
tion of the synteny between human chromo- 
somes and those of chicken, pufferfish (Fugu 
rubripes), or zebrafish {95). The only sub- 
stantial syntenic stretches mapped in these 
species corresponding to both pairs of human 
, duplications are restricted to the Hox cluster 
; regions.., When, the synteny of these regions. - 
(or others) to human chromosomes is extend- 
ed with further mapping, the ages of the 
nearly chromosome-length duplications seen 
in humans are likely to be dated to the root of 
vertebrate divergence. 

The MUMmer-based results demonstrate 
large block duplications that range in size from 
a few genes to segments covering most of a 
chromosome. The extent of segmental duplica- 
tions raises the question of whether an ancient 
whole-genome duplication event is the under- 
lying explanation for the numerous duplicated 
regions (96). The duplications have undergone 
many deletions and subsequent rearrangements; 
these events make it difficult to distinguish 
between a whole-genome duplication and mul- 
tiple smaller events. Further analysis, focused 
especially on comparing the estimated ages of 
all the block duplications, derived partially 
from interspecies genome comparisons, will be 
necessary to determine which of these two hy- 
potheses is more likely. Comparisons of ge- 
nomes of different vertebrates, and even cross- 
phyla genome comparisons, will allow for the 
deconvolution of duplications to eventually re- 



■•»: Having a complete genome sequence enables 

v ■.researchers to achieve a dramatic acceleration 
in the rate of gene discovery, but only through 

- analysis of sequence variation in DNA can wc 
discover the genetic basis for variation in health 
. among human beings. Whole-genome shotgun 

/•sequencing is a. particularly effective method 

/ for detecting sequence variation in tandem with. 

• whole-genome assembly.' In addition, we com- • 
pared me distribution and attributes of SNP* '; 
•ascertained bythree other/methods: (i) align* 
ment of the Celera consensus sequence 10 the 
PFP assembly, (ii) overlap of high-quality reads 
of genomic sequence (referred to as "Kwok M ; 
1,120,195 SNPs) (P7), and (iii) reduced repre- 
sentation shotgun sequencing (refeffed to as 
•TSC?; 632,640 SNPs) f?8). These data were 

/. .consistent in showing an bverall nucleotide di- 
versity of ~8 X 10-y marked heterogeneity 
across the genome in SNP density, and an 
overwhelming preponderance of noncoding 
variation that produces no change in expressed 
proteins. 

6.1 SNPs found by aligning the Celera 
consensus to the PFP assembly 

Ideally, methods of SNP discovery make full 
use of sequence depth and quality at every site, 
and quantitatively control the rate of false-pos- 
itive and false-negative calls with an explicit 
sampling model (99). Comparison of consensus 
sequences in the absence of these details ncccs- 
sitated a more ad hoc approach (quality scores 
could not readily be obtained for the PFP as- 
sembly). First, all sequence differences between 
the two consensus sequences were identified; 
these were then filtered to reduce the contribu- 
tion of sequencing errors and misassembly. As 
a measure of the effectiveness of the filtering 
step, we monitored the ratio of transition an 
transversion substitutions, because a 2:1 ratio 
has been well documented as typical in mam- 
malian evolution (100) and in human SNl * 
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m 102) The filtering steps consisted of, 

WhCrc fte score h ie 

S w r? 5 " 5 ^ l^s than 30 and where 
tedn^rar variants was greater tta, 5 JX 

fiIteiS - reSUlted 111 «* transT 

= - oetween,.the ; Celera and PFP Wnsus V '■ 

27»lS^^? hW .- SNft - fc »-« total of , 
2 ^ differences ^ Overlap : 

■ ;- between this set* SNPs and those SmX 
" me *ods are described below. J^S' 

131 <n i www ; ncbi n lm nih.gov/SNP) and 
I ' ti o' n 15 n*r HGMD (Human Gene MuTa 

from the Unive ™* of 

' sefmK w ere mapped on the Ceiera con- 
^ sequence shnflarity 
search with the program PowerBlast (103) The 
. . .two la^st data sets in dbSNP are £e Kwok 

^TSC sets, wim 47% and 25o/ 0 of thtdSS 
«*ords. Uw-quality alignments with partial 

between the Celera sequence and the dbSNP 
flankmg sequence were eliminated. dbSNP Z 

2 33?Q^Tct m Were discarded - A total of 
* SNP 1 Variants ■*« mapped to 
1,223,038 umque locations on the Celera se 
quence, nnplying considerable redunZcyt 

f££i ^ m &e TSC set mapped to 
585^11 unique genomic locations, and SNPs in 
Je Kwok set mapped to 438,032 unique S 

in this analysis, mcluding Celera-PFP TSC 
and Kwok, is 2,737,668. Table 15 shows Z 
substantial fraction of SNPs Uma££££ 
Jese methods was also found by anothS meth 
od The verylugh overlap pdSESt" 

So 4^l^ bly ; 11,6 low overlap 

(16.4/.) between the Kwok and TSC sets is due 

^V4S&e„ S ^ * W 1 ■""*"*■• 
each pafrof&teseb ; SNP count$ for 

the frartinn !f Nu ^bers in parentheses are 



TSC 
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SWs dlv^ ( 2 era - PfP SNPs overlap with 
SNPs derived from the Celera genome se- 
• WW^-rthta ^population 
•"Wta : « an .expensive .and laborio^rocess 
WftMto onmultiple da* sets may pS : 
? «de an efficient initial validation "in silico" (W - f 
computational analysis). W 
One.- means of, assessing ■ .whether the 

"cSS^—-^^ ^ fl^ftequen: 
iC ) e s. •oUhe.srx^possible :baseychangls 

^J^^iMU^eale analysis 'on can- ' 

a£ three data :sets. validates the previous V 
observations at .the . whole-genome scale . ' 

the. SNPs -found m the Kwok- set, the TSC 
; set and m our whole-genome shotgun (46) ' 
m this substitution pattern. Compared with ' 
the rest of-the data sets, Celera-PFP 

&r n TK ratip 0 Y eiVed m M other 
W*P sets,. This .result is not unexpected 

because some fraction of the computation- 
ally: identified -SNPs in the Celera-PFP- 
companson may in fact be sequence errors. :. 
t ° slt,on ftransversion ratio for the 

bona fide SNPs would be obtained if one 
assumed that 15% of the sequence differ- 
ences ,n the Celera-PFP set were a resu of 
(presumably random) sequence errors 

6.3 Estimation of nucleotide diversity • 
from ascertained SNPs y 

The , number of SNPs identified varied 
, widely across chromosomes., In order to 
normal^ these values to the chromosome 

!tf rl SeqUenCe C0Vera « e > we «sed ir, the 
sundard statistic for nucleotide diversity 
^•Nucleotide diversity is* measure, of- 
per-site heterozygosity, quantifying the 

Zl ^ V a paiT ° f ^omosomes . 
drawn from the population will differ at a 
nucleotide site. In order to calculate nucle- 
otide diversity for each chromosome, we 

sSZ ° W ^ nUmber of nuc ' e otide 
•n iVS* SUrveyed for varia tion, and 
in methods hke reduced respresentation se- 
quencing we need to know the sequence 
quality and the depth of coverage at each 



' we c Sm f DOt readiI y avai 'able, so 
we could not estimate nucleotide diversity 
. from the TSC effort. Estimation of nuc7e 0 
,-t,de ) d I vers.ty..:from, h igh :q uality sequence 

f d f Probabil.typfdefectingaSN^S 
, the probabHity of correct sequence ca "S fe' - 

. ■ the sequence quality - the higher is the bhance 
. of successfully detecting a.SNP (10S) Sen 
> after cprrectmg for, variation in coverage the ' 

..-nerty-was tested by ; analysis of variance wid, 
y ^stimates of « for .lOO-kbp windows to 2 

tim^T d,V ? rs ^ for autosomes es- 
, timated from ..the, Celera-PFP. comoarison 

: x 10 " 4; N «S™ 

me. x chromosome was 6.54 X I 0" 4 The 
J is expected to be less variable than au- 
tosomes, because for every four copies of 

: ; aufosomesmthepopulation, ffiereare on? 
ttree X chromosomes, and this smaller ef- 

^l^ PUhti ° n ^ si ? e mean s that random 

■fromme^^^ 

..flaving. ascertained nucleotide variation 
.genome-wide, it appears that previous esti- 
mates of nucleotide diversity in humans 
based on . samples of ^genes were reasonably 
accurate (101, 102,106, 107);. Genome-wide 

JI ;X " V^i for ^ e Celera-PFP alignment 
; and,a;publ,shed . estimate averaged over 10 
densely resequenced /human- genes was 
8.00 X 10-" (108). 



6.4 Variation in nucleotide diversity 
across the human genome ^ 

Such an apparently high degree of variabil- 
ity among chromosomes, in SNP density 
raises the question of whether there is hX 
erogeneity at a finer scale within chromo- 
Table 16..Summary of nucleotide changes In different SNP data sets. 



Celera-PFP 
TSC 



188.694 
(0322) 
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• . Hg. 13. Segmental duplica- 
~ tions between chromo- 
•^somes in the human ge- 
nome. The 24 panels show 
the 1077 duplicated blocks 
of genes, containing 10310 
■ pairs of genes fn total Each 
- One represents a pair of ho- 
mologous genes belonging 
to a block; all blocks con- 
tain at least three genes 
on each of the chromo- 
somes where they appear. 
Each panel shows all the 
.-. duplications between a ' 
. single chromosome and 
other chromosomes with 
shared blocks. The chro- 
mosome at the center of 
, each panel Is shown as a 
thick red line for emphasis. 
Other chromosomes are 
displayed from top to bot- 
. torn within each panel or- 
dered by chromosome 
number. The inset (bot- 
tom, center right) shows a 
close-up of one duplica- 
tion between chromo- 
somes 18 and 20, expand- 
ed to display the gene 
names of 12 of the 64 
gene pairs shown. 




Chr20 



1332 



16 FEBRUARY 2001 VOL 291 SCIENCE www^ciencemag.org 



•HE HUMAN GENOME 




wwwjciencemag.org SCIENCE VOL 291 16 FEBRUARY 2001 



1333 




j The human genome 

i 

| '■■ ;i .somes, 'a and ; whether . this -.heterogeneity -is.::;; otides: 1 We tallied the:GC content . and ^ the total SNP 

: y Y greater than expected by chance. If SMV .xleotide /diversities and Kwok 

| -V ; i f occur by random and independent mutations; ".- across the entire genome and found that the •>,>.; :SNPs/ respectively. -Nonconservative pro- 
I . ; then it would seem that there ought to be a ; . ■ correlation between them was positive (r = ;tein changes constitute an even smaller frac- 
1 Poisson distribution of numbers of SNPs in . 7 0.21) and highly significant (?;< 0.0001), ,V.tion ; . of. rrussense . SNPs (47, 41, and 40% in 

i : . fragments of arbitrary constant size. The ob- .p. but G+C - content accounted ..for .only a. ; Celera-PFP, Kwok, and ;TSQ. Intergenic re- . 

v; served dispersion in the distribution of SNPs r.i small part of the variation. . ■ '•.>•?'• - '. v - : M ; gions have been ^ortually. unstudied . 
! ■ : in 100-kbp fragments was far greater than , ■ : .' vv.we note:^ 

j predicted from a Poisson distribution (Fig." <V- 6.5 SNPs by genomic class. . . . -were intergenic (Table '17). The SNP rate was 

j y ;< : 14). However, this simplistic, model ignores.:-: ;<To: test:, homogeneity of : SNP r: : densities jr;:highest in introns; and lowest in exons: The SNP. : . 
j * : . the different recombination rates andpopula- r-. across functional classes, we partitioned derate. was .lower. in^intergenic' regions than in 
! ^ v: tion histories that exist in different regions of:.; -sites into, intergenic (defined, as ;>5kbp . /introns, providing one of the fii^ discriminators 
: : : : the genome. Population genetics theory holds ;from any predicted transcription unit), 5'- : v between these two classes of DN A. these SNP 
; .that we can account for this variation with a UTR;: exonic . (missense and silent), in- c rates were confirmed in the Celera SNPs, which 
, . mathematical formulation calied the neutral ;. , tronic, and 3VUTR : for . 10,239 known S'ialso exhibited a lower rate in exons than in 
. v! coalescent (70P). Applying well-tested algo- V> genes, .derived. from the NCBhRefSeq da-;;-^ introns} and in. extrageriic regions than in in- 
•rVrithms for simulating the neutral coalescent ^ -tabase and . all human genes predicted from »,; trons (46). Many of these intergenic SNPs will 
. .'/ .with recombination (770),. and using an ef- /^the Celera \Otto annotation. In coding, re^ in the form of 

■ ; fective population size of 10,000 and a per- i\ gions, SNPs .were categorized as either. si- markers for linkage and assbciation studies, and 
. base recombination rate equal to the mutation^; lent,, for those that do not change, amino ;-;.some .fraction .is likely to have a regulatory 
.rate (111), we generated a distribution of num- \ .. acid sequence, or missense, for those that . function as well. 
. bers of SNPs by this model as well (112), The .change the protein product. The ratio of " . 

. iobsewed distribution of SNPs has a much larg- ^ missense to silent coding SNPs in Celera- ^ 7 An Overvie w of the .Predicted 
er variance than either the Poisson model or the , PFP, TSC, and Kwok sets (1.12, 0.91, and ' Protein-Coding Genes in the Human 
coalescent model, and the difference is highly • 0.78, respectively) shows a markedly re- Genome 

significant This implies that there is significant ; : duced frequency of missense variants com- i Summary. : This section provides an initial 
variability across the genome in SNP density, pared with the .neutral expectation, consis- computational analysis of the predicted 
an observation that begs an explanation. tent with the elimination by natural selec- protein set with the. aim of cataloging 

Several attributes of the DNA sequence tion of a fraction of the deleterious amino prominent differences and similarities 
. ■ may affect the local density of SNPs, in- , acid changes (772). These ratios are com- / when the human genome.is compared with 
v " eluding the rate at which DNA polymerase t. f parable, to. the missense-to-silent ratios of .--ibther. fully ^sequenced eukaryotic genomes, 
makes errors and the efficacy bf mismatch ; 0.88 and 1/17 found by Cargill et al (101) Over ^ 40% of Jhe predicted protein set in . 
repair. One key factor that is likely to be and by Halushka et al. (702). Similar re- - .humans; cannot be ascribed .a molecular 
associated with SNP density is the G+C suits were observed in SNPs derived from function by methods that assign proteins to 
content, in part because methylated cy- Celera shotgun sequences (46). known families. A* protein domain-based 

tosines in CpG dinucleotides tend to under- It is striking how small is the fraction of analysis provides a detailed catalog of the 
go deamination to form thymine, account- SNPs that lead to potentially dysfunctional prominent differences in the human ge- 
ing for a nearly 10-fold increase in the alterations in proteins. In the 10,239 Ref- nome when compared with the fly and 
mutation rate of CpGs over other dinucle- ! Seq genes, missense SNPs were only about i- worm genomes. Prominent among these are 

..domain expansions in proteins involved in 
^developmental regulation and in cellular 
processes such as neuronal function, hemo- 
stasis, acquired immune response, and cy- 
toskeletal complexity. The final enumera- 
tion of protein families and details of pro- j 
tein structure will rely on additional exper- 
imental work and comprehensive manual 

curation. i 

i 

i 

A preliminary analysis of the predicted hu- 
man protein-coding genes was conducted. 
Two methods were used to analyze and clas- 
sify the molecular functions of 26,588 pre- 
dicted proteins that represent 26,383 gene 
predictions with atleast twolines of evidence 
as described above. The first method was 
based on an analysis at the level of protein 
families, with both the publicly available 
Pfam database (114, 115) and Celenfs Pan- 
ther Classification (CPC) (Fig. 15) (116). 
The second method was based on an analysis 
at the level of protein domains, with both the 
Pfam and SMART databases (775, 777). 

The results presented here are prelimi- 
nary and are -subject to several limitations. 




Number of SNRs / 100 kb 

Fig. 14. SNP density in each 100-kbp interval as determined with Celera-PFP SNPs. The color codes 
are as follows: black, Celera-PFP SNP density; blue, coalescent model; and red, Poisson distribution. 
The figure shows that the distribution bf SNPs along the genome is nonrandom and is not entirely 
accounted for by a coalescent model of regional history. 
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Both the gene predictions and functional 
assignments have been made by using com- 
putational tools, although the statistical 
models m Panther, Pfam, and SMART have 
. been.built, annotated, and reviewed by ex- '< 
V. pert biologists. In .the set of computationally 
»v predicted genes, we expect both false-positive 
predictions (some of these may in fact be inac- 
tive pseudogenes) and false-negative predic- 
hons (some human genes will not be computa- 
Ponally predicted). We also . expect enbrs in 
delimiting the boundaries of exons and genes 
Similarly, in the automatic functional assign- 
ments, we also expect both false-positive and 
false-negative predictions. The functional as- 
signment protocol focuses on protein families 

that tend to be found across several organisms 
or on families of known human genes. There- 
fore, we do not assign a function to many genes 
that are not in large families, even if the func- 
tion is known. Unless otherwise specified, all 
enumeration of the genes in any given family or 
fractional category was taken from the set of 
26,588 predicted proteins, which were assigned 
functions by using statistical score cutoffs de- 
fin«T for models in , Panther, Pfam, and 
oMART. 

For this initial examination of the pre- 
dieted human protein set, three broad ques- 
tions were asked: (i) What are the likely 
molecular functions of the predicted gene 
products, and how are these proteins cate- 
gorized with current classification meth- 
ods? (ii) What are. the core functions that- 
appear to be common across the animals?' 



The human Genome 

(iii) How does the human protein comple- 
ment differ from that of other sequenced 
eukaryotes? 



7.1 Molecularfunctioris^of prfedic^cT^ 
human proteins. V 

Figure 15 shdws .an overyiewjof, the puta- 
tive rmolecular: functions/ of/the predicted 
26,588 .human/proteins Jth'at have at -least 
: two -lines of : supporting 'evidence. J- About, 
: 41%:(12,809),-of;the.;g 
not .be,classified;;frpm ;this mitiaUanalysis- 
and are . termed .proteins , with unknown , 
^functions. Because pur automatic classifl- 
. cation methods treat only relatively 'large' 
protein .families, ,there are- a number of' 
"unclassified" sequences .that do, in .fact 
have a known or predicted function. For the • 

60% pf the.protein. set that :have automatic • 
.functional predictions, .the specific protein 
functions have been -.placed into = broad 
classes. We focus here on molecular- func- 
tion (rather than higher order cellular pro- ' 
cesses) in order to classify as many proteins 
as possible.; .These functional predictions " 
are , based on similarity ; to sequences of 
known function. 

In our analysis of the 12,731 additional low- 
confidence predicted genes (those, with only '''' 
one piece of suprx>rtmg:.evidence), only 636 . 
(5%) of these additional putative genes were 
assigned molecular functions by the automated 
methods. "One-third .-of these 636: predicted 
genes represented endogenous retroviral pro- 
teins, further suggesting mat the majority of ? 



these imknown-function genes are not real 
genes. Given that most of these additional 
12,095. genes appear to be unique among the 
^genomes sequenced to .date;, many may simply 
^represent false T positive gene preclictions. 

«*The mpst,common molecular functions are 
^e^traiiscnpnon;factdrs; and th6se involved in 
; . nucleic acid metabolism (nucleic* acid enzyme) 

;:the.human : g^^ 
and hydrolases Not surprisingly, most of the' 
^hydrolases are proteases. : There are also many 
.proteins that ; are members of proto-oncogene 
famihes, as well as- families of "select regula- 
rly molecules": (i) proteins involved in specif- - 
ic steps of signal transduction such as hetero- 
trimenc GTP-binding proteins (G proteins) and : 
cell cycle regulators, and © proteins that mod-' 
ulate the activity of kinases, G. proteins, and 



Table. 17 Distribution of SNPs in classes of 
genomic regions. ' r . , 



Genomic region 
class 



Size of; 



Celera-PFP 



Intergenic 
Gene (intron + 

exon) 
Intron 

First intron : 
Exon 

First exon 



region 
examined 
(Mb) 


SNP 
. density 
(SNP/Mb) 


2185 


■; 707 


646 


917 


615 


921 


164 


■ 808 


31 


529 


10 


592 



cell adhesion (577, 1.9%) 
miscellaneous (1318, 4.3%) 
viral protein (100,03%) 
transfen'camcr protein (203, 0.7%) 1 
'^transcn^ V-l 



A 




nucleic acid enzyme (2308, 7.5%) ^ 

signaling molecule (376, 1.2%) ^ 
. wceptor(l543,5.0%) 

kinase (868. 2.8%) 
select regulatory molecule (988, 3.2%) 



tninsfcrase(6IO,2.0%) 
synthase and synthetase (31 3, 1.0%) 



oxWoreductase (656, 2.1 %) ^ / 
Vasc(ll7,0.4%)^y 
!igase(56,0J%)/ 
isomcrasc(163,0L5%j 
hydrolase (1227, 4.0%) 



( . , ■ •; ■ > 

fHbonesfc'*? 




<*apcrone(l59,0.5%) 

cytoskelctal structural protein (876\ 2.8%) 
extracellular matrix (437, 1.4%) 
immunogIobuIiri(264, 0.9%) 
Ion channel (406, 1J%) 
/ y motor (376,1.2%) 

structural protein of muscle (296, 1 .0%) 
^ protooncogene (902, 2.9%) 
^^^•sekxt calcium binding protein (34, 0.1%) 

. intracellular transporter (350, 1.1%) 

transporter (533, 1.7%) 

i 




p^GO categories 




molecular function unknown (12809, 41.7%) 

Panther categories 
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Fig. 15. Distribution 
of the molecular 
functions of 26,383 
human genes. Each 
~ slioe Jists-the . numr^ 
. bers and percentages 
. (in . parentheses) of 
human gene functions 
Bssigj^ed to a given 
category of molecular 
function. The outer cir- 
cle shows the assign- 
. ment to molecular 
function categories in 
the Gene" Ontology 
(GO) (779), and the 
inner circle shows 
the assignment to 
Celera's Panther mo- 
lecular function cate- 
gories (776). 
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7.2 Evolutionary conservation of core 
processes N 

Because of .the .various .. "model organism" 
i 'j genome-sequencing .projects that have al-. 
. . ready been completed, reasonable compara-. 

tive information is available for beginning the. 
V. analysis of, the evolution ; of the human ge-, 
5 nome. The genomes of S. cerevisiae (''bak-: 

ers' . yeast") {118) and two diverse inverted 

brates, G elegans (a nematode worm) (119); 
. and D. melanogaster (fly) (26); as well as the 

first plant genome, A. thaliana, recently com- V 

pleted (92), provide a diverse background for - 

genome comparisons! 

: We enumerated the "strict orthologs" con- ? 

.. served between human and. fly, and between.'; 
human and worm (Fig. 16) to, address -the.; 

.. question, What are the core functions - that ;? 
appear to be common across the animals? , ; 
The concept of orthology . is important be- ; 

. cause if two genes are orthologs, they can be 
traced by descent to the common ancestor of • 
the two organisms (an ."evolutionarily. cpn-,':. 
served protein set"), and therefore are likely 
to perform similar conserved functions in the '[ 
different organisms. It is critical in this anal- 
ysis to separate orthologs (a gene that appears 
in two organisms by descent from a common 
ancestor) from paralogs (a gene that appears 

. in more than one copy in a given organism by 

: a duplication event) because paralogs may -, 
subsequently diverge in function. Following ■ 
the yeast-worm ortholog \ comparison .in.. 
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•v. (120), we identified . two different cases for > 
each .pairwise ..comparison . (human-fly * and, - 
human-worm). The first case ..was a pair of. 
: v.- genes, , one from each organism, .' for which - 
V. there was no other close homqlog in either 
.V. ..organism. .These are straightforwardly identi- 
v - fied as orthologous, because there are no * 
• : : >. additional member^ of me families that com- -i 
^plicate. Separating . orthologs from paralogs/ : 
.< . -The .second . case . is. a family of genes • with y 

; more than one member in. either or both of the 
/..organisms being compared. Chervitz et ah 
; (120) ; deal with this case by analyzing a.-. 
.^.Iphylogenetic tree that described the relation-.. 
. - , ships .between all of the sequences jn both 
^.organisms, and. then looked for pairs of genes • j 

- O that .were nearest . neighbors in the tree. If the 
>■';> nearest-neighbor pairs, were from different .1 i; 
^organisms, those genes were presumed to beU 

- orthologs. We note. that these nearest neigh-. 
vbors.can often be confidently identified from . 

pairwise sequence comparison without hay-. V 
•ing to examine a phylogenetic-tree (see leg- ; ; 
end to Fig. 16). If the .nearest neighbors .are ,• 
not from different organisms,. there, has been 
, a paralogous expansion in one or both organ- - 
isms after the speciation event (and/or a gene 
loss by one organism). When this one-to-one 
, correspondence is lost, defining an ortholog 
: becomes ambiguous. For our initial compu- .. 
. tational overview of the predicted human pro-.C 
tein set, we could not answer this question, for 
every predicted protein. Therefore, we! con- . $ 



;*);sider:qnly."strict orthd i.e., the proteins 
*;.:,with -unambiguous; one-to-one relationships 
(Fig. .16) By these catena, there are 2758 
V;strict i/huni^-iQ^oiiliblogSk 2031 human- 
: /?;Wonri'.(1523 in common between these sets). 
v; We definethe eyolutionarily conserved set as 
those 1523 human proteins that : have strict 
v. orthologs in both ^D. ^melanogaster /arid C 
elegans. 

v g The distribution of the. functions of the 
^..conserved protein . set is ; shown: in Fig. 16. 
■Comparison with "Fig.;.! 5. shows that, not 
.surprisingly, the. set of conserved proteins is 
• not distributed among molecular. functions in 
.. .the.same way as, the whole human protein set. 
'/.Compared with \the. whole human set (Fig. . 
■; 15) > there are several categories that are over- 
i •. represented in the;con.served set by a factor of 
• f *z2 or mqre. /The first category is nucleic acid 
• -enzymes,, primarily": the.vtranscriptional ma- 
chinery (notably : : ; DNA/RNA methyltrans- . 
vferases, ;.DNA/RNA;. polymerases, helicases, 
■J DNA ligases, ;PNA- ■ and^RNA-processing 
■ factors, nucleases,- and ribosomal proteins). 
/The basic transcriptional and translational 
machinery is well known to have been con- 
served oyer evolution, from bacteria through 
to the most complex eukaryotes.. Many ribo- 
nucleoproteins involved in RNA splicing also 
appear to be . conserved , among the animals.; 
Other enzyme types are also oyerrepresent- , 
ed ; (transferases, ioxidoreductases/: ligases, 
; lyases, .and isomerases). Many of. these en- 



Fig. 16. Functions of putative 
orthologs across vertebrate 
and invertebrate genomes. 
Each slice lists the number and 
percentages (in parentheses) 
of "strict orthologs" between 
the human, fly, and worm ge- 
nomes Involved in a given cat- 
egory of molecular function. 
"Strict orthologs" are defined 
here as bi-directional BLAST 
best hits [180) such that each 
orthologous pair (i) has a 
BLASTP P-value of sicr 10 
(720), and (ii) has a more sig- 
nificant BLASTP score than 
any paralogs in either organ- 
ism, i.e., there has likely been 
no duplication subsequent to 
speciation that might make 
the orthology ambiguous. This 
measure is quite strict and is a 
lower bound on the number of 
orthologs. By these criteria, 
there are 2758 strict human- 
fly orthologs, and 2031 hu- 
man-worm orthologs (1523 in 
common between these sets). 



cytoskclctal structural protein (20, 1.2%) 
chapcronc ( 1 6, 0.9%) 
cell adhesion (1 1.0.6%\ 
miscellaneous (72, 42%) ^ 
viral protein (4, 02%) s 
. tnuisfer/camerproteinO 1,0.6%)* 

transcription factor (8 1 , 4.7%) . 



nucleic acid cn^ me (221, 12.9%) 



cxtraccllularmairix(l2,0.7%) 
ion channel (7, 0.4%) 

motor (1 3. 0.8%) . ... — - ^ _ ... 
,structural protein of muscle (8, 0.5%) 
protoonco£cnc(23, 1 3%) 

intracellular transporter (51. 3.0%) 

transporter (44. 2.6%) 



receptor (23, 13%) 



kinase (69. 4.0%) 



select regulatory molecule (88, 5.1%) 



transferase (70, 4.1%) 




synthase and synthetase (64, 3.7%) 

oxidorcductase(64, 3.7%) 



molecular function unknown (613, 35.8%) 



ryase(l2,0.7%) 
lipase (9, 0.5%) 



hydrolase (80. 4.7%) 
!somcrase(2l, 1.2%) 
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zymes are involved in intermediary metabo- 
lism. The only exception is the hydrolase 
category, which is not significantly overrep- 
• resented in. the shared protein.set. Proteases 
, form toe. largest part of this category, and 
• several large protease families have expanded ■ 
ui each .of these three .organisms , after their 
divergence. The .category of select regulatory 
molecules is also overrepresented in the con- 
served set. The; major .conserved families are V 
" small , guanosine , triphosphatases . (GTPases) 
(especially. the ,Ras'.related superfamiiy in- ' 
! eluding ! ADP .ribos^lation factor) and 'celf- 
j V cycle regulators (particularly the cuUin fam- 
; ily, cyclin Cfamily, and several cell division 
protein kinases). The last two significantly 
• overrepresented categories are. protein trans- • 
port and trafficking, and chaperones. The 
most conserved groups in these categories are 
•, proteins, involved in coated vesicle-mediated 
fransport, and chaperones involved in protein V 
folding and heat-shock response [particularly : 
2? famiIy ' md heat "Shock protein 

60 (HSP60), HSP70, and HSP90 . families] 
TCese observations provide only a conserva-' 
tive estimate of the; protein families in the 
context , of specific cellular processes that 
were , likely derived from the last common 
ancestor of the human, fly, and worm. As 
stated before, this analysis does not provide a 
complete estimate of conservation across the 
three animal genomes, as paralogous dupli- 
cation makes the determination of true or- 
thologs difficult within the members of con- 
served protein families. 
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7.3 Differences between the human 
genome and other sequenced 
eukaryotic genomes 

To explore the molecular building blocks of 
the vertebrate taxon, we have compared the 
- humaiugenome witfcthe. other sequenced 
eukaryotic genomes at three levels: molec- 
ular functions, protein families, and protein 
domains. 

Molecular differences can be correlated 
with phenotypic differences to begin to reveal 
the developmental and cellular processes that 
are unique to the vertebrates. Tables 18 and 
19 display a comparison among all sequenced 
eukaryotic genomes, over selected protein/ 
domain families (defined by sequence simi- 
lanty, e.g., the serine-threonine protein ki- 
nases) and superfamilies (defined by shared 
molecular function, which may include sev- 
eral sequence-related families, e.g., the cyto- 
kines). In these tables we have focused on 
(super) families that are either very large or 
that differ significantly in humans compared 
with the other sequenced eukaryote genomes . 

We have found mat me most prominent hu- 
man expansions are in proteins involved in (i) 
acquired immune functions; (ii) neural devel- 
opment, structure, and functions; (iii) inter- 
cellular and intracellular signaling pathways 



in development and homeostasis; (iv) hemo- 
stasis; and (v) apoptosis. 

Acquired immunity. ,One .of the most 
: stnking,differences .between the- human ge- 
f>:nome.<and;the Drosophila or C, elegans ge- 
/ >nome : is the;appearance;of genes involved in 
•. acquired immunity (Tabies 18:and 19).M"his 
is expected,, because the -acquired .immune 
. response. is- ^defense. system that ohIy : occurs' 
.y ,in vertetetesj^We ^bsen)e^2 ciass .I and 22 

. : (MHC) antigen: genes -ano^l 14= other iimmu- 
. : npglpbulmrgenes.m ,meihuman 'genome. to ■ 
., addition,- there ,are \59>genes, in. the -cognate' 
:.n^ At the do- 

: mam level, this is, exemplified by an expan- 
■ :Sion : and recruitment of the. ancient immuno-' 
globulin fold to constitute molecules such as 
. :^MHG,.and.of the integrin fold to form several 
yof the cell adhesion molecules that mediate ) 
• mteractions between;, immune effector cells 
i ^ d ; t hejextracellular.matrix.;yertebrate-spe- 
i t cific. proteins.include ■ the. paracrine immune 
regulators family, of secreted 4-alpha helical 
^bundle proteins, namely the cytokines and 
; .chemokines.vSome of the cytoplasmic signal 
transduction components associated with cy- 
tokme,receptor.signal,transduction ; are also 
features that are poorly represented in the fly 
and worm; Iliese ^include protem domains 
found m the signal transducer and activator of 
transcription (STATs), the suppressors of cy- 
tokine signaling (SOCS), and protein inhibi- 
tors, of activated STATs (PIAS). In contrast, 
many of the animal-specific protein domains 
that play a role in innate immune , response, 
such as the Toll receptors, do not appear to be 
. significantly expanded in the human genome. 

Neural development, structure, and 
• function. In the human genome, as compared 
with the worm and fly genomes, there is a 
markeoVincrease*m : the number of members 
of protein families .that .are involved in 
; neural.developmenf/Examples include neu- 
rotrophic factors such as ependymin, nerve : 
growth factor, and signaling molecules 
such as semaphoring as well as the number 
of proteins involved directly in neural 
structure and function such as myelin pro- 
teins, voltage-gated ion channels, and syn- 
aptic proteins such as synaptotagmin. 
These observations correlate well with the 
known phenotypic differences between the 
i nervous systems of these taxa, notably (i) 
the increase in the number and connectivity 
of neurons; (ii) the increase in number of 
distinct neural cell types (as many as a 
thousand or more in human compared with 
a few hundred in fly and worm) (121); (iii) 
the increased length of individual axons; 
and (iv) the significant increase in glial cell 
number, especially the appearance of my- 
elinating glial cells, which are electrically 
inert supporting cells differentiated from 
the same stem cells as neurons. A number 
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of prominent protein expansions are in- 
volved in the processes of neural develop- 
.. w:ment.,Of the extracellular domains that me- 

rf-diate^ellvadhesion,:'me-connexin domain- 
^ containing proteins (122): exist only in hu- 
) ; ; man s.These proteins^which are not present 
Vvm Xht Drosdphila or C, v*/^. genomes, 
>■-■ appear to provide the constitutive subunits 
... of interceljulaf: channels and the Structural 
^W*, forrelectricai icpupl jrig.a>ath way find- 
ing by.axons and 'neuronal^ 
'^tion is mediated through ^subset of ^ephriris 
>y and their cognate receptor tyrosine ^ kinases 
that -act as. -positional .labels : to,: establish 
: : topographical projections (123). The prbb- 
able biological role for the. semaphores (22 
, . in human compared with' 6 in the fly and 2 
; « .in the worm) and their, receptors (neuropi- 
,/.. lins and plexins) is-that of axonal guidance 
?.i f ^p»lcc^;(/2<), S^ii^^i^ci such 
< ,as neurotrophic factorS;'and some cytokines 
;>,;have. been shown to 'regulate neuronal cell 
y .survival, proliferation; ; and' axon guidance 
(725). rKotch receptors arid iigands play 
important roles, in; glial cell fate determina- 
•/ itipn and gliogenesis (126)'. 
. .. 2 Other human expanded gene families play 
; key roles directly, in neural structure and 
function. ;One example is synaptotagmin (ex- 
panded more than twofold in humans relative 
• to the invertebrates), originally found to reg- 
ulate synaptic transmission by serving as a 
- Ca 2+ sensor (or receptor) .during' synaptic 
v; vesicle fusion and release (127). Of interest is 
: the increased co-occurrence 'in -humans of 
PDZ and the SH3r domains in -neuronal- 
specific adaptor molecules; examples include 
... protems that likely modulate channel activity 
at synaptic junctions (128). . We also noted 
expansions in several ion-channel families 
(Table 19), including the EAG subfamily 
■ i (related to cyclic nucleotide gated channels)- ~ 

the r voItage-gated calcium/sodium channel 
-(■f&naSfyj the inward-rectifier potassium chan- 
nel family, and the. voltage-gated potassium 
channel, alpha subunit family. Voltage-gated 
sodium and potassium channels are involved 
m the generation of action potentials in neu- 
rons. Together with voltage-gated calcium 
channels, they also play a key role in cou- 
pling action potentials to neurotransmitter re- 
lease, in the development of neurites, and in 
short-term memory. The recent observation 
of a calcium-regulated association between 
sodium channels and synaptotagmin may 
have consequences for the establishment and 
regulation of neuronal excitability (129). 

Myelin basic protein and myelin-associat- 
ed glycoprotein are major classes of protein 
components in both the central and peripheral 
nervous system of vertebrates. Myelin PO is a 
major component of peripheral myelin, and 
myelin proteolipid and myelin oligodendro- 
cyte glycopotein are found in the central 
nervous system. Mutations in any of these 



1337 



The Human genome 



Table 18. Domain-based comparative analysis of proteins.in H. sapiens (H) 
D. melanogaster (F), C etegans. (W), S. cerews/ae.(Y} ( and A^nata/ja (A). The \ 

, predicted protein set of each of the above eukaryotic organisms was analyzed • 
with Pfam version 5.5 using E value cutoffs of 0.001. : The number of proteins 

■;. containing the specified Pfam domains as well as the total number of domains' : 
(in parentheses) are shown in each column. Domains were categorized into 
cellular processes for presentation. Some domains (i.e., SH2) are listed in ; 



■ -r more than. one. cellular process. Results of the'Pfam Analysis may differ from 
^results obtam.ed based on human curation^f protein families, owing to the 
^limitations of ^rge-scale.automaticclassificatlons: Representative examples 
v of domains with reduced counts owing to the stringent E value cutoff used for 
...Vth.s analys.s are marked with a. double asterisk (♦*); Examples include short 

divergent and predominantly alpha-helical. domains; -and certain classes of 

.cysteme-nch zinc finger proteins. ' . 



Accession 
number 



Domain name 



domain description 



H 



PF02039 
PF00212 
PF00028 
PF00214 
PF01110 
PF01093 
PF00029 
. PF00976 
PF00473 • 
PF00007 
PF00778 
PF00322 
PF00812 
PF01404 
PF00167 
. PF01534 
PF00236 
PF01153 
PF01271 
PF02058 
PF00049 
PF00219 
PF02024 
PF00193 
PF00243 
PF02158 
PF00184 
PF02070 
PF00066 
PF00865 
PF00159 
PF01279 
PF00123 
PF00341 
PF01403 
PF01033 
PF00103 
PF02208 
PF02404 
PF01034 
PF00020 
PF00019 
PF01099 
PF01160 
PF00110 

PF01821 
PF00386 
PF00200 
PF00754 
PF01410 
,PF00039 
PF00040 
PF00051 
PF01823 
PF00354 
PF0O277 
PF00084 
PFO'2210 
PF01108 
PF00868 
PF0O927 



Adrenomedullin 
ANP 
Cadherin 
Calc_CGRP_IAPP 
CNTF 
^Clusterin . 
Connexin 
ACTH^domain 
CRF 

Cysjcnot 
DIX 

Endothelin 
Ephrin 
EPhJbd 
FGF 
Frizzled 
HormoneS 
Clypican 
Cranin 
Cuanytin 
Insulin 
ICFBP 
Leptin 
Xlink 
NGF 

Neuregulin 
: HormoneS 
NMU 
Notch 

Osteopontin 
Hormone3 
Parathyroid 
Hormone2 
PDCF _ 
Sema 

Somatomedin_B 
Hormone 
Sorb 
SCF 

Syndecan 
TNFILc6 
TCF-p 
Uteroglobin 
Opiodsjieuropep 
Wnt 

ANATO 
C1q 

Disintegrin 
F5_F8_type_C 
COLFI 
Fnl 
Fn2 
Kringle 
MACPF 
Pentaxin 
SAA^proteins 
Sushi 
TSPN 
Tissuejac 
Transglutamin.N 
Transglutamin_C 



..v: ^Developmental and homeostatic 
Adrenomedullin 
. Atrial natriuretic peptide 
. Cadherin domain 
Calcitonin/CGRP/IAPP family 
. . : Ciliary neurotrophic factor 
Ctusterin 
Connexin ■ 

Corticotropin ACTH domain 
: Corticotropin-releasing factor family 
. Cystine-knot domain 

Dix domain . 

Endothelin family 

Ephrin 

Ephrin receptor ligand binding domain 
Fibroblast growth factor 
/. .Frizzled/Smoothened family membrane region 
Glycoprotein hormones 
Glypican 

Grainin (chromogranin or secretogranin) 

Guanylin precursor 

Insulin/IGF/Relaxin family 

Insulin-like growth factor binding proteins 

Leptin 

LINK (hyaluron binding) 
Nerve growth factor family 
Neuregulin family : 
Neurohypophysial hormones 
Neuromedin U 
Notch (DSL) domain 
Osteopontin 

Pancreatic hormone peptides 
Parathyroid hormone family 
Peptide hormone 

Platelet-derived growth factor (PDGF) 
Sema domain 
• Somatomedin B domain 
Somatotropin 

Sorbin homologous domain 

Stem cell factor 

Syndecan domain 

TNFR/NGFR cysteine-rich region 

Transforming growth factor 0-like domain 

Uteroglobin family 

Vertebrate endogenous opioids neuropeptide 
Wnt family of developmental signaling proteins 

Hemostasis 

Anaphylotoxin-like domain 
Clq domain 
Disintegrin 
F5/8 type C domain 
Fibrillar collagen C-terminal domain 
Fibrqnectin type I domain . 
Fibronectin type II domain 
Kringle domain 
MAC/Perforin domain 
Pentaxin family 
Serum amyloid A protein 
Sushi domain (SCR repeat) 
Thrombospondin N-terminaWike domains 
Tissue factor 
Transglutaminase family 
Transglutaminase family 



regulators 

1 

2 

100(550) 
3 
1 
3 

.'. 14(16) 
1 
2 

10(11) 
5 

7(8) 
12 
23 
9 
1 

14 
3 
1 
7 

10 
1 

13(23) 
. 3 
4 
1 
1 

3(5) 
1 
3 

5(9) 
S 
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5(8) 
1 
2 
2 

17(31) 
27(28) 

3 

3 
18 

6(14) 
24 
18 
15(20) 

.10 . 
5(18) 
11(16) 
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9 
4 
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14 
1 
6 
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Accession 
number. 



. Domain name 



■ i Domain description 



PF00594 , ,Gla 



PF00711 Defensin_beta 

PF00748 Calpainjnhib 

PF00666. \ Cathelicidins 

PF00129 MHO : . 



W 



. PF00993 
PF00969 
PF00879 
PF01109 
PF00047 
PF00143 
PF00714 
PF00726 
PF02372 
PF00715 
PF0O727 
PF02025 
PF01415 
PF0O34O 
PF02394 
PF02059 
PF00489 
PF01291 

PF00323 
PF01091 
PF00277 
PF00048 



MHCJLalpha** 
MHCJI_beta** 
Defensin_propep 
GM.CSF 

Interferon 

IFN-gamma 

IL10 

IL15 

IL2 

IL4 

IL5 

IL7 

IL1 

IL1_propep 

1 13 

IL6 

LIFJDSM 

Defensins 
PTN_MK 
SAA_proteins 
IL8 



PF01582 TIR 
PF00229 TNF 
PF00088 Trefoil 



PF00779 
PF00168 
PF00609 
PF00781 
PF00610 

PF01363 

PF00§9r" 

PF0O5O3 

PF00631 

PF00616 

PF00618 

PF0O625 
PF02189 
PF00169 
PF00130 

PF00388 

PF00387 

PF00640 

PF02192 

PF00794 

PF01412 ' 

PF02196 

PF02145 

PF00788 

PF00071 

PF00617 

PF00615' 

PF02197 



BTK 
C2 

DACKa 
DAGKc 
DEP 

FYVE _ 

-Mb! 

G-alpha . 
G-gamma 
RasGAP 
RasGEFN 

Cuanylate kin 

ITAM 

PH 

DAG.PE-bind 
PI-PLC-X 
PI-PLC-Y 
PID 

Pl3ICp85B 
PI3ierbd 
ArfGAP 
RBD 

Rap.GAP 

RA 

Ras 

RasGEF 

RGS 

RJIa 



rr \ - -Vitamin K-dependent carboxylation/gamma- " 
: ■• •* :"carboxyglutamic (GLA) domain 

Beta defensin ^mune response 

Catpain inhibitor repeat . 
\ Cathelicidins:*. .\ ' 

, .Class I histocompatibility antigen;, domains alpha i 

" and 2 "~ • ; • . r . 

Class II histocompatibility antigen, alpha domain 
Class II histocompatibility antigen, beta domain 
Defensin propeptide 

Granulocyte-macrophage colony-stimulating factor 

Immunoglobulin domain * 

Interferon alpha/beta domain 

Interferon gamma 

lnterleuk?n-10 

lnterleukin-15 ' 

lnterleukin-2 

lnterteukin-4 

lnterleukin-5 

lnterleukin-7/9 family 

lnterleukin-1 

lnterleukin-1 propeptide 

lnterleukin-3 

lnterleukin-6/G-CSF/MGF family 
Leukemia inhibitory factor (LIF)/oncostatin (OSM) 
family ' 

Mammalian defensin 

PTN/MK heparin-binding protein 

Serum amyloid A protein 

Small cytokines (intecrine/chemokine), 

interleukin-8 like 
TIR domain . 

TNF (tumor necrosis factor) family 
Trefoil (P-type) domain 

BTK motif " CTte 
C2 domain 

Diacylglycerol kinase accessory domain (presumed) 
Diacylglycerol kinase catalytic domain (presumed) 
Domain found in Dishevelled, Egl-10 and 

Pleckstrin (DEP) 
FYVE zinc finger 

GDP dissociation Inhibitor *\ * 

. C-protein alpha subunit 
G-protein gamma like domains 
GTPase-activator protein for Ras-like GTPase 
Guanine nucleotide exchange factor for Ras-like 

GTPases; N-terminal motif 
Guanylate kinase 

Immunoreceptor tyrosine-based activation motif 
PH domain 

Phorbol esters/diacylglycerol binding domain (CI 
domain) 

Phosphatidylinositol-specific phospholipase C X 
domain • r 

Phosphatidylinositol-specific phospholipase C, Y 
domain 

Phosphotyrosine interaction domain (PTB/PID) 
PI3-kinase family, p85-binding domain 
PI3-kinase family, ras-binding domain 
Putative GTP-ase activating protein for Arf 
Raf-Uke Ras-binding domain 
Rap/ran-GAP 

Ras association (RalGDS/AF-6) domain 
Ras family 
RasGEF domain 

Regulator of G protein signaling domain 
Regulatory subunit of type II PKA R-subunit 
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5(6) 
7 
3 
1 

381 (930J 
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2 
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32 
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5(6) 



0 
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•. . - ;0 
0 

,: V- • 0 

0 
0 
0 
0 

125(291) 
0 
0 
0 
0 
0 
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0 
0 
0 
0 

0 
0 
0 
0 

8 
0 
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.0 
. 0 . 

• • - o 
o 

o 
o 
o 
o 

67(323) 
0 
0 
0 
0 
0 

0 

0 

0 

0 

0 

0 

0 

0 

0 
0 
0 
0 

2 
0 
2 



0 
0 
0 

: 0 

0 
0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 
0 
0 
0 
0 
0 
0 

0 
0 
0 
0 

0 

6 

0 



24(27) 
2 
6 
16 
6(7) 
5 

18(19) 
126 
21 
27 
4 



13 
1 
3 
9 
4 
4 

7(9) 
56(57) 
8 

6(7) 
1 



11(12) 
1 
1 
8 
1 
2 
6 
51 
7 

12(13) 
2 



0 
0 
0 
6 
0 
0 

1 

23 
5 
1 
1 



0 
0 
0 

■J . 0 

0 
0 
0 
0 
0 
0 
0 
0 
0 
0 

0 

0 

0 

0 

0 

0 

0 

0 

0 
0 
0 
0 



131 (143) 
0 
0 



5 

73(101) 
9 
10 
12(13) 


1 

32 (44) 
4 
8 
4 


0 

24(35) 
7 
8 
10 


0 

6(9) 
0 
2 
5 


0 

66 (90) 
6 

11(12) 
2 


_ 28(30)-:, 
6 

27(30) 
16 
11 
9 


- - 14 - 
2 
10 
5 
5 
2 


-Jl-5 
1 

^0(23) 
5 
8 
3 


- 5 
1 
2 
1 
3 
5 


- 15 
3 
5 
0 
0 
0 


12 
3 

193(212) 
45(56) 


8 
0 

72(78) 
25(31) 


7 
0 

65(68) 
26(40) 


1 
0 

. 24 
1(2) 


4 
0 
23 
4 


12 


3 


7 


1 


8 


11 


2 


7 


1 


8 



0 
0 
0 
15 
0 
0 
0 
78 
0 
0 
0 
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Table 18 (Continued) 



THE HUMAN GENOME 



Accession 
number . 



Domain name-' 



. Domain description 



H 



W 



.'PF00620 
PF00621 
PF00536 
PF01369 

, . PF00017 
PF00018 
PF01017 
PF00790 

. PF00568 

PF00452 
PF02180 
PF00619 
PF00531 
PF01335 
PF02179 
PF00656 
PF00653 

PF00022 
PF00191 
PF004O2 
PF00373 
PF00880 
. PF00681 
PF00435 
PF00418 
PF00992 
PF02209 
PF01044 

PF01391 
PF01413 

PF00431 
PF00008 
PF00147 

PF00041 

PF00757 

PF00357 

PF00362 

PFO0O52 

PF00053 

PF00054 

PF000S5 

PF00059 

PF01463 

PF01462 

PF00057 . 

PF00058 

PF00530 

PF00084 

PF00090 

PF00092 

PF00093 

PF00094 

PF00244 . 

PF00023 

PF00514 

PF00168 

PF00027 

P"F01556 

PF00226 

PF00036 

PF00611 

PF01846 

PF00498 



RhoGAP 
RhoGEF 
SAM 
Sec7 
SH2 
SH3 
: STAT 
VHS 
WH1 

Bd-2 
- BH4 
CARD 
Death 
DED 
BAG 
ICE_p20 
BIR 

Actin 

Annexin 

Catponin 

Band_41 

Nebulin_repeat 

Plectin_repeat 

Spectrin 

Tubulin-binding 

Troponin 

VHP ' 

Vinculin 

Collagen 
C4 

CUB 
EGF 

Fibrinogen^ 
Fn3 

Furin-tike 
Integrin^A 
..IntegrinJJ 
Lamininji 
Laminin_EGF 
Laminin_C 
LamininJMterm 
Lectin_c 
LRRCT 
LRRNT 
LdLrecept.a 
LdLrecept b 
SRCR 
Sushi 
Tsp_1 
Vwa 
Vwc 
Vwd 

14-3-3 
Ank 

Armadillo_seg 
C2 

cNMP binding 

DnaJ_C 

Dnaj 

Efhand** 

FCH 

FF 

FHA 



RhoGAP domain 
; RhoGEF domain 

SAM domain (Sterile alpha motif) 
Sec7 domain 
: Src homology 2 (SH2) domain 
Src homology 3 (SH3) domain 
STAT protein 
VHS domain . . 
WH1 domain 



59 
■•. 46 
- 29(31) 
13 

87(95) 
M 43 (182) 
7 
4 
7 



Domains involved in apoptosis 

Bc( : 2 . 
. . Bcl : 2 homology rejgion 4 
. Caspase recruitment domain 
Death domain 
Death effector domain ' 
Domain present in Hsp70 regulators 
ICE-like protease (caspase) p20 domain 
Inhibitor of Apoptosis domain . 

. Cytoskeletal 

Actin , 

Annexin . , 

Calponin family 

FERM domain (Band 4.1 family) 
• Nebulin repeat 
Plectin repeat 
Spectrin repeat 

Tau and MAP proteins, tubulin-binding 
Troponin 

ViUin headpiece domain 
Vinculin family 

. ' ECM adhesion 
Collagen triple helix repeat (20 copies) 
C-terminal tandem repeated domain In type 4 

procollagen 
CUB domain 
EGF-like domain 

Fibrinogen beta and gamma chains, C-terminal 

globular domain 
Fibronectin type III domain 
Furin-like cysteine rich region 
Integrin alpha cytoplasmic region 
. Integrins. beta chain 
Laminin B (Domain IV) 
Laminin EGF-like (Domains III and V) 
Laminin G domain 
Laminin N-terminal (Domain VI) 
Lectin C-type domain 
Leucine rich repeat C-terminal domain 
Leucine rich repeat N-terminal domain 
Low-density lipoprotein receptor domain class A 
Low-density lipoprotein receptor repeat class B 
Scavenger receptor cysteine-rich domain 
Sushi domain (SCR repeat) 
Thrombospondin type 1 domain 
von Willebrand factor type A domain 
von Willebrand factor type C domain 
von Vyillebrand factor type D. domain 

Protein interaction domains 

14-3-3 proteins 
Ank repeat 

Armadilto/beta-catenin-like repeats 
C2 domain • 

Cyclic nucleotide-binding domain 
Dnaj C terminal region 
Dnaj domain 
EF hand 

Fes/CIP4 homology domain 
FF domain 
FHA domain 



9 
' 3 
16 
_ 16 
4(5) 
5(8) 
11 
8(14) 



19 

Y -23(24) 
15 
5 

33(39) 
55(75) 
1 
2 

. 2 

- . 2 
0 
0 

. '5 . 

0 
3 

5(9) 



20 

.18(19) 
8 
5 

44(48) 
46(61) 
1(2) 
4 

2(3) 

1 
1 
2 
7 
0 
2 

2(3) 



9 

, 3 
.3 
5 

23(27) 
0 
4 
1 

•■, 0 

. o 

0 

o 

0 

1 

0 

1(2) 



20 

145 (404) : 
22(56) 
73(101) 
26(31) 
12 
44 

83(151) 
9 

4(11) 
13 



8 
0 
6 
9 
3 
4 
0 
8 
0 

0 
0 
0 
0 
0 
5 
0 
0 



61 (64) 


15(16) 


12 


9(11) 


24 


16(55) 


4(16) 


4(11) 


0 


6(16) 


13 (22) 


3 


7(19) 


0 


0 


29 (30) 


17(19) 


11 (14) 


"0 


0 


4(148) 


1(2) 


1 


0 


0 


2(11) 


0 


0 


0 


0 




13(171) 


10 (93) 


0 


0 


4(12) 


1(4) 


2(8) 


0 


0 


4 


6 


8 


0 


0 


5 


2 


2 


0 


5 


4 


2 


1 


b 


0 


65(279) 


: = .10(46) / 


174(384) 


" : 0 


0 


6(11) 


2(4) 


3(6) 


0 


0 


47(69) 


9(47) 


43 (67) 


0 


0 


108 (420) 


45 (186) 


54(157) 


0 


1 


26 


10(11) 


6 


0 


0 


106 (545) 


42 (168) 


34(156) 


. - 0 - 


1 


5 


2 


1 


0 


0 


3 


1 


2 


0 


0 


8 


2 


2 


0 


0 


8(12) 


4(7) 


6(10) 


0 


0 


24(126) 


9(62) 


11(65) 


0 


0 


30(57) 


18(42) 


14(26) 


0 


0 


10 


6 


4 


0 


0 


47(76) 


23(24) 


91 (132) 


0 


0 


69(81) 


23(30) 


7(9) 


0 


0 


40(44) 


7(13) 


3(6) 


0 


0 


35(127) 


33(152) 


27(113) 


0 


0 


15(96) 


9(56) 


7(22) 


0 


0 


11(46) 


4(8) 


1(2) 


0 


0 


53 (191) 


11 (42) 


8(45) 


0 


0 


41 (66) 


11(23) 


18(47) 


0 


0 


34(58) 


0 


17(19) 


0 


1 


19(28) 


6(11) 


2(5) 


0 


0 


15(35) 


3(7) 


... .. .9 


0 . 


0 



:.3"'.' 


.*■.■" 3 


"2 "' 


15 


72 (269) 


75(223) 


12(20) 


66 (111) 


11(38) 


3(11) 


2(10) 


25(67) 


32 (44) 


24(35) 


6(9) 


66(90) 


21(33) 


15(20) 


2(3) 


22 


9 


5 


3 


19 


34 


33 


20 


93 


64(117) 


41 (86) 


4(11) 


120(328) 


3 




4 


0 


4(10) 


3(16) 


2(5) 


4(8) 


. 15 


7 


13(14) 


17 



i 1340 



16 FEBRUARY 2001 VOL 291 SCIENCE wwwjdencemag.org 



myelin proteins result in severe demyelina- 
uon , which is. a pathological condition in 
which the myelin is lost and the nerve con- 
duction is severely impaired (/30).,Humans 
; have,at. least 10. genes, belonging to four 
different families involved in myelin nroduc- 



Table 18 (Continued) 



The human genome 

tion (five myelin P0, three myelin proteolip- 
id, myelin basic protein, and myelin-oligo- 
dendrocyte glycoprotein, or MOG), and pos- - 
• :; : f^^ 0re - rem <)tely .related members .of the - 
; :MOG femily^Flies have only.a single myelin i 
i.vproteolipid,: and wonhV have none at all :/ 



r 

Intercellular and intracellular signalino 
pathways in development and homeostasis" 
• Many protein families that have expanded in 
^umans.^^ m m _ 

-yolved.m.signaling processes, particularly in 
-response to development and m'fTerentiation 



: Accession 
: number . 



Domain name 



. ■: - Domain description . .* . 



PF00254 
PF01590 
PF01344 
PF00560 
PF00917 
PF00989 
PF00595 
PF00169 
PF01535 
PF00536 
PF01369 
PF00017 
PF00018 
PF01740 
PF00515 
PF00400 
PF00397 
PF00569 

PF01754 
PF01388 
PF01426 
PF00643 
PF00533 
PF00439 
PF00651 
PF00145 
PF00385 



PF00125 

PF00134 

PF00270 

PF01529 

PF00646 

PF00250 

PF0032CT 

PF01585 

PF00010 

PF00850 

PF00046 

PF01833 

PF02373 

PF0237S 

PF00013 

PF01352 

PF00104 



FKBP 
CAF 
. Ketch 
LRR** 
MATH 
PAS 
PDZ 
PH 

PPR** 
SAM ;. 
Sec7 
SH2 
SH3 
STAS 
TPR** 
WD40** 
WW 
ZZ 

Zf-A20 
ARID 
BAH 

Zf-B_box** 
BRCT . 

Bromodomain 
BTB 

DNA_methylase 
Chrorrio 

Histone 
Cydin 
DEAD 
Zf-DHHC 
F-box** 
„ Forehead 

G-patch 
HLH** 

Hist_deacetyl 

Homeobox 

TIG 

JmjC 

JmjN 

KH-domain 
KRAB 

Hormone_rec 



PF00412 
PF00917 
PF00249 
PF02344 
PF017S3 
PF00628 
PF00157 
PF022S7 
PF00076 

PF02037 
PF00622 
PF01852 
PF0O9O7 



UM 
MATH 

Myb.DNA-bindine 

Myc-U 

Zf-MYND 

PHD 

Pou 

RFX_DNA_binding 
Rrm 

SAP 
SPRY 
START 
T-box 



«BP-type peptidyl-prolyl cis-trans isomerases 
GAF domain 
Kelch motif . 
: Leudne Rich Repeat 
MATH domain 
PAS domain 

PDZ domain (Also frown as DHR or GLCF) 
PH domain ' 
PPR repeat 
: SAM domain (Sterile alpha motif) - 
Sec7 domain 

Src homology 2 (SH2) domain 
Src homology 3 (SH3) domain 
STAS domain 
TPR domain 
WD40 domain 
WW domain 

ZZ-Zinc finger present in dystrophin, CBP/p300 
A20-iike zinc finger Nudear Action domains 
ARID DNA binding domain 
BAH domain 
B-box zinc finger 

. BRCA1 C Terminus (BRCT) domain 
Bromodomain * 
BTB/POZ domain 

C-S cytosine-specific DNA methylase 
^™j* CHRromat!n Conization Modifier) 

Core histone H2A/H2B/H3/H4 
Cydin 

DEAD/DEAH box helicase 
DHHC zinc finger domain 
F-box domain 
? oric n ??aVdomain . . 

GATA zinc finger ~ T 

G-patch domain 



15(20) 
7(8) 
54(157) 
25(30) 
11 

18(19) 
96(154) 
193(212) 
5 

: 29(31) 
13 

87(95) 
143(182) 

72(131) 
136(305) 
32(53) 
10(11) 

2(8) 
11 
8(10) 
32(35) 
17(28)- 
37(48) . 
97(98) 
3(4) 
24(27) 



7(8) 
2(4) 
12(48) 
24(30) 
5 

9(10) 
60(87) 
72(78) 

3H) 
15 
5 

- 33(39) 
55(75) 
1 

39(101) 
\ 98(226) 
24(39) 
13 

2 
6 

7(8) 
1 

10(18) 
16(22): 
62(64) 

; 14(15) 



703) 4 

1 0 

13(41) 3 

7(11) ; 1 

88(161) t 

6 ! 

46(66) 2 

65(68) 24 

I... 0 1 

8 3 

5 5 
'44(48) i 

46(61) . 23(27) 

6 2 
28(54) 16(31) 

72(153) -.56(121) 

16(24) 5(8) 

10 2 



• Helix-loop-helix DNA-binding domain 
Histone deacetytase family 
Homeobox domain . 
IPT/TIG domain 
JmjC domain 
JmjN domain 
KHdomafn 
KRAB box 

Ligand-binding domain of nuclear hormone 
receptor 

UM domain containing proteins 
MATH domain * 
Myb-like DNA-binding domain 
Myc leudne zipper domain 
MYND finger 
PHD-finger . 

^5^* te I minaI t0 hom «>box domain 
RFX DNA-binding domain 

RNA recognition motif (aJca. RRM. RBD, or RNP 

domain) 
SAP domain 
SPRY domain 
START domain 
T-box 



75(81) 
•19 
63(66) 
15 
16 
35(36) . 
11(17) 
18 
60(61) 
12 

160(178) 
29(53) 
10 
7 

28(67) 
204(243) 
47 

62(129) 
11 

32(43) 
1 
14 

68(86) 
15 
7 

224(324) 
15 

44(51) 
10 
17(19) 



5 
10 
48(50) 
20 
' 15 
20(21) 
5(6) 
16 
44 
5(6) 
100(103) 
11(13) 
4 
4 

14(32) 
0 
17 

33 (83) 
5 

18(24) 
0 
14 

40(53) 
5 
2 

127(199) 
8 

10(12) 
2 
8 



2 
4 

4(5) 
2 

23(35) 
18(26) 
86(91) 
0 

17(18) 

71(73) 
10 

... 55(57) 
16 

309(324) 
15.. 
8(10) 
13 
24 
8(10) 
82(84) 
5(7) 
6 

17(46) 
0 

142(147) 

33(79) 
88(161) 
17(24) 

0 

9 

32(44) 
4 
1 

94(145) 
5 

5(7) 
6 
22 



0 
2 
5 
0 

10(16) 
10(15) 
1(2) 

... 0 
1(2) 

8 



24(29) 
10 

102(178) 
15(16) 
61 (74) 
13(18) 
5 
23 

474(2485) 
6 
9 
3 
4 
13 

65 (124) 
167(344) 
11(15) 
10 

8 
7 

21 (25) 
0 

12(16) 
28 
30(31) 
V. 13(15) 
12 

48 



11 


35 


50(52) 


84(87) 


7 


22 


9 


165(167) 


; .4 . 


- - a 


9 


26 


4 


14(15) 


4 


39 


5 


10 


6 


66 


2 


1 


4 


7 


3 




4(14) 


27(61) 


0 


0 


0 


0 


4(7) 


10(16) 


1 


61 (74) 


15(20) 


243 (401) 


0 


0 


1 


7 


14(15) 


96(105) 


0 


0 




0 


43(73) 


232 (369) 


5 


6(7) 


3 


6 


0 


23 


0 


0 
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The human genome 



Table 18 [Continued) 



Accession 
number 



: Domain name. 



■ Domain description 



W 



h • ■ -., - ; % ■ : 

1* PF02135 
! PF01285 


Zf-TAZ 
TEA 


TAZ finger 
TEA domain 


2(3) 
4 


!(2) 

.1 

\: (3) 
,4(8) 


6(7) 


0 


10(15) 


PF02176 


Zf-TRAF 


. . . TRAF-type zinc .finger ' 


: :' ;6(9) ; 
; : 2 (4) . 


1. 
1 

^v2(4) ; 


• 1 • 


o 


PF00352 

i • • ' ' / 


TBP 


Transcription factor TFilD (or TATA-binding 
\protein; TBP) "'. . 


..o • 
::;i(2)- 


2(4) 


; PF00567 
PF00642 
PF00096 
. , PF00097 
. . PF00098 


TUDOR 

. Zf-CCCH . : 

Zf-C2H2** 
Zf-C3HC4 
Zf-CCHC 


. TUDOR domain 

, Zinc finger C-x8-C-x5-C-x3-H type (and similar) 

Zinc finger, C2H2 type 
.. . Zinc finger, C3HC4 type (RING finger) : 

Zinc knuckle 


9(24) 
17(22) 
> 564 (4500) 
.135(137) 

9(17) 


•9(19)- 
-6(8), 
234(771) . 
57 

. 6(10) 


: ; 4(5). 
' ,22 (42) 
68(155) 
-88(89) 
17(33) 


o 

• r f 3(s)V 

'-34(56) 
18 

,7(13) 


. 2 

: 31(46) 
V21 (24) 
298 (304) 
: 68(91) 



. (Tables 18- and 19). They include secreted 
-,. hormones and growth factors, receptors, in- 
• tracellular signaling molecules,' and transcrip- 
tion factors. ■ 
: . Developmental signaling molecules that are 
. • enriched in the human genome include growth 
factors such as wnt, transforming growth fac- 
. tor-p (TGF-p), fibroblast growth factor (FGF), 
nerve growth factor, platelet derived growth, 
factor (PDGF), and ephrins. These growth fac- 
tors affect tissue differentiation and a wide 
range of cellular processes involving actin-cy- 
toskeletal and nuclear regulation. The corre- 
. sponding receptors of these developmental li- 
gands are also expanded in humans. For exam- 
ple, our. analysis suggests at least 8 human 
ephrin genes (2 in the fly, 4 in the worm) and 12 
-. ephrin receptors (2 in the fly, 1 in the worm)..In 
the wnt signaling pathway, we find 18 wnt 
family genes (6 in the fly, 5 in the worm) and 
12 frizzled receptors (6 in the fly, 5 in the 
worm). The Groucho family of transcriptional 
corepressors downstream in the wnt pathway 
are even more markedly expanded, with 13 

prechcted members m humans (2 m me fly, 1 in , 
the worm). 

Extracellular adhesion molecules involved 
in signaling are expanded in the human genome 
(Tables 18 and 19). The interactions of several 
of these adhesion domains with extracellular 
matrix proteoglycans play a critical role in host 
defense, morphogenesis, and tissue repair 
{131). Consistent with the well-defined role of 
heparan sulfate proteoglycans in modulating 
these interactions (752), we observe an expan- 
sion of the heparin sulfate sulfotransferases in 
the human genome relative to worm and fly. 
These sulfotransferases modulate tissue differ- 
entiation (753). A similar expansion in humans 
is noted in structural proteins that constitute the 
actin-cytoskeletal architecture. Compared with 
the fly and worm, we observe an explosive 
expansion of the nebulin (35 domains per pro- • 
tein on average), aggrecan (12 domains per 
protein on average), and plectin (5 domains per 
protein on average) repeats in humans. These 
repeats are present in proteins involved in mod- 
ulating the actin-cytoskeleton with predominant 
expression in neuronal, muscle, and vascular 
tissues. 



i Comparison across the .five sequenced eu- 
. ^::karyotjc;organisms revealed several expand- 
/, ed protein families and domains involved in 
. s cytoplasmic signal transduction (Table 18). 
, j . : .In v particular, • signal ; transduction ".pathways 
y \ .playing roles in" developmental regulation and 
. acquired immunity were substantially;, en- 
y/.riched. -There is . a factor of 2 or greater ex- 
pansion in humans in .the Ras superfamily 
GTPases and the GTPase activator and GTP: 
: exchange factors associated with them, Al- 
. though there are about the, same number of 
tyrosine kinases in the human and C. elegans 
genomes, in humans there is an increase in 
the SH2, PTB, and ITAM domains involved 
■ in phosphotyrosine : signal .transduction. Fur- 
ther, there , is a . twofold expansion of -phos- 
; phodiesterases . in the human .genome . com- • 
pared with either the worm or fly genomes. : .'■ 
■ The . downstream effectors of the intracellu- 
; lar signaling molecules include the transcription . . 

factors that transduce developmental fates. Sig- 
. nificant expansions are noted in the ligand- 
- binding nuclear hormone receptor class of tran- 
. .scription factors compared with the,fly genome, 
.although not to the extent observed in the worm \ 
; (Tables 18 and 19). Perhaps the most striking 
expansion in humans is in the C2H2 zinc finger 
transcription factors. Pfam detects a total of 
4500 C2H2 zinc finger domains in 564 human 
proteins, compared with 771 in 234 fly proteins. 
This means that there has been a dramatic 
expansion not . only in the number of C2H2 
transcription factors, but also in the number of 
these DNA-binding motifs per transcription 
factor (8 on average in humans, 3.3 on average 
in the fly, and 2.3 on average in the worm). 
Furthermore, many of these transcription fac- 
tors contain either the KRAB or SCAN do- 
: mains, which are not found in the fly or worm 
genomes. These domains are involved in the 
oligomerization of transcription factors and in- 
crease the combinatorial partnering of these 
factors. In general, most of the transcription 
factor domains are shared between the three 
animal genomes, but the reassortment of these 
domains results in organism-specific transcrip- 
tion factor families. The domain combinations 
found in the human, fly, and worm include the 
BTB with C2H2 in the fly and humans, and 



t homeodomains alone or m combination with 
:VPou 'and .LIMJ domains' -in all of the animal .' 
genomes;: In^lants,' : hoyever, : a different set of 
, ; /transcription factors are expanded, namely, the 
;v*.myb family/and a unique set thatiricludes VP1 
7 and AP2 domain^ntaining proteins (734). 
■>v : The yeast genomehas a paucity of transcription 
factors compared ^with >the : multicellular eu- 
r karyotes, and. its repertoire . is ^limited to . the 
... expansion of the yeast-specific C6 transcription 
: factor family involved in metabolic regulation! 
/While we have illustrated expansions in a 
subset of signal transduction molecules in the 
human genome compared with the other eu- 
, karyotic genomes, it should be noted that 
> most of the protein ^domains are highly con- 
served An interesting j observation is that 
. worms and humans ^haye . approximately the 

- . same number ;of bbth tyrosine kinases and 
; serine/threonine kinases (Table 19). It is im- 
. portant to note^ however, that these are mere- 
ly counts of the catalytic domain; the proteins 

• .that contain these domains also display a 
■ •wide .repertoire of interaction domains with" 

- . significant combinatorial diversity. 
•>£ ' -:Hemosta^i&::Hemostasis is regulated pri- 
marily by plasma proteases of the coagulation 
pathway and by the interactions that occur be- 
tween the vascular endothelium and platelets. 
Consistent with known anatomical and physio- 
logical differences between vertebrates and in- 
vertebrates, extracellular adhesion domains that 
constitute proteins integral- to hemostasis are 
expanded in the human relative to the fly and 
worm (Tables 18 and 19). We note the evolu- 
tion of domains such as FIMAC, FN1, FN2, 
and Clq that mediate surface interactions be- 
tween hematopoeitic cells and the vascular ma- 
trix. In addition, there has been extensive re- . 
cnritment of nKwe-ancient aninial-specific do- 
mains such as VWA, VWC, VWD, kringle, 
and FN3 into multidomain proteins that are 
involved in hemostatic regulation. Although we 
do not find a large expansion in the total num- 
ber of serine proteases, this enzymatic domain 
has been specifically recruited into several of 
these multidomain proteins for proteolytic reg- 
ulation in the vascular compartment These are 
represented in plasma proteins that belong to 
the kinin and complement pathways. There is a 
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me^opioteases; ADAM (a disintegrin 
metalloprotease) and MMPs (matrix me taS 
^teases) (Table 19). Pmteolysis • 

^lopment and for tissue degradation inT 
2* such as cancer/athritis, 'Alzheimer's di '* 

■ S3 St THE* ma ™»«*y Editions " 
(/» , 7J<5). ADAMs are a family of integral 

. .^enolys^ and, ;mw hda&^ 
• tween ; hematopoietic ^components i and the' y " 
hTt 1 m \ te component! These Sterns ' 
have been shown to cleave matrix prot m? 
and even SI gnaling molecules: ADAM-17 

MAM ,,/rr DeCrosis f -tor^and 
ADAM-10 has been implicated in the Notch 
^ pathway {13 S). We have identified 
19^members of the matrix metalloprotease 

••■•IS& *"! ^ ° f 51 ambers of £e . 
■ ADAM and ADAM-TS families 

som1 P of Evolutionai y conservation of 
IZl ? 6 ap ° pt0tic pathwa y components 

roTeTfT " COnSiSteDt wflh its central 
rote m developmental regulation and as a 
: response to pathogens and stress signals The 
«gnal transduction pathways involved in pre- 
yed cell death, or apoptosis, are medi- 
ated by interactions between well-character- 
ized domains that include extracellular do- 
2 ^(Protein-protein interaction) 
domams, and those found in effector and 
regulatory enzym es (/i7) . We emimerated 
ti* protein counts of central adaptor and ef- 
fector enzyme domains that are found only L 
Aeapoptohc pathways to provide an estimate 
of divergence across eukarya and relative 

co- 
pared with the fly and worm (Table 18) 

o^v P tf fl T" 5 f0Und fa proteins rest rictea 
only to apoptotic regulation such as the DED 

t£m£ ^ rtebrat ^Pccific, whereas oth 
pf Bcl2 family members in humans is signif- 

\lTlTT dedy A,though pIan * 

nSel^ t^ 3568 ' C f PaSe - like mole cules, 
namely the para- and meta-caspases, have 

tSFTi* thKe 0Iganisms W Com! 
pared with other animal genomes the hum£i 
genome shows an expansion in the adapto? 
and effector doinain-containing proteins in- 
volved m apoptosis, as well as b the pro- 
teases involved in the cascade such L £ 
caspase and calpain families 

A/iw anSl ° nS ° f ° ther Protem families. 
Metabolic enzymes. There are fewer cyto- 
chrome P450 genes in humans than Lt 

nl^ T^t "Phases (six in hu 
nians), on the other hand, appear to be specific 
to the vertebrates and plants, whereas mTlip- 

may be^vertebrate-specific. Lipoxygenases are 

Z2 ft ^ ^ BCid me 2b<S?an^ 
they and then- activators have been implicated 
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ifiT 36 hUmaD Path0,0g y from 
allergic responses to cancers. One of the most 

prising human expansions, however, is in 
:Je number ? ofgIyceraldehyde-3-phosph a te; 
^genase^GAPDH) genes (46 b hu-.: 
mans, 3.in.the.fly, and 4 in the worm): There 
IV >wever, evidence/for many .retrotrans- 



posed GAPDH pseudogenes {139), which 
may account for , this apparent expansion 
. v;H° we yer, at-is interesting.that GAPDH Ion* 
; mvolved . » 

>;Jas;c.metabohsm found acress ail phyla from 
-;bactena to-humans, has recently been showO 
)-to have other functions, It his a second cZ 
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i « even been 

man expansions has occui^cS* fl- ' ^^^vSZ!^7^^^ **** musc,e «* a C °V 
• : ...ilies involved in the translation^ machine™ " -V • ™£ ^« V,^™ • ^ v Md *e ^ late d L7^,mentary .expression pattern to.the ubiquitous. 



-denied 28differentri : ' v^^T— ^ - , - ; ' : 

, vthat each have at least 10 conies ,VthV i«:V:-.: ft*n£ff& alcn v£r >V K---<>^V^^^q^^t^live^licin* 

factor - 1 -alnna . familv rrAno , x~,4 . - ■• * - ■: .7. ■ r . 



v that each have at least 10 copies in ttie^e Tv 

.nome; oh average/ for all ribbsomalp r , — ^ - ^ ^ ml^. 

. . . times the number of ribonucleoprotein genes 

; K>about,the; sanie^ ^ ^ 

. H . F w y a • * V r ^ raA ^ 0 ^ ■ genome. Werner me. diversity 

; : Cf.of : ribonucleoprotein genes in humans con- 



Table 19 (Continued) 
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tributes to gene regulation at either the splic- 
ing or .transnational level is unknown. 

Posttranslational modifications. In this 
• set of processes, the ; most prominent expan- 
sion is the transglutaminases, calcium-depen- 
dent enzymes that catalyze the cross-linking 
of proteins in cellular processes such as he- 
mostasis and apoptosis (147), The vitamin 
K-dependent gamma carboxylase gene prod- 
uct acts on the GLA domain (missing in the 
fly and worm) found in coagulation factors, 
• > osteocalcin, and matrix GLA.protein (148). 

Tyrosylprotein ; sulfotransferases . participate 
:■■ v in the posttranslational modification of pro- 
teins involved in inflammation and hemosta- 
sis, including coagulation factors and chemo- 
kine receptors (149). Although there is no 
significant numerical increase in the counts 
for domains involved in nuclear protein mod- 
• ^cation, there are a number of domain ar- 
rangements in the predicted human proteins 
that are not found in the other currently se- 
quenced genomes. These include the tandem 
association of two histone deacetylase do- 
mains in HD6 with a ubiquitin finger domain, 
a feature lacking in the fly genome. An ad- 
ditional example is the co-occurrence of im- 
portant nuclear regulatory enzyme PARP 
(poly-ADP ribosyl transferase) domain fused 
to protein-interaction domains— BRCT and 
VWA in humans. 

Concluding remarks. There are several 
possible explanations for the differences in 
phenotypic complexity observed in humans 
when compared to the fly and worm. Some of 
these relate; tq the. prominent differences in . 
the immune system, hemostasis, neuronal, : 
vascular, and cytoskeletal complexity. The 
finding that the human genome contains few- 
er genes than previously predicted might be 
compensated for by combinatorial diversity 
generated at the levels of protein architecture, 
transcriptional and translational control, post- 
translational modification of proteins, or 
posttranscriptional regulation. Extensive do- 
main shuffling to increase or alter combina- 
torial diversity can provide an exponential 
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urease in the ability to mediate protein- 
protein interactions without dramatiilyfa 
creasing the absolute size of the proSm 
Plement (150). Evolution of zppvcZZ 
(from ^ tive Qf ^ 

. .P^tem. domains .and increasing rectory 
, ;;^P exity by-domain accretion both 

el domains with preexisting ones) are Jo 
■^s.thatwe.observe in humans PerZl 
^ebestmus^nonof^ 

Zt^eT^ ^P&>° factors; 
where we see expansion in the number of 
domains per protein, together with verte- 
brate^specirlc domains such as KRAB and 
S^ ec f W on the prominent use 
leZ^T* Inb0 T alen ^ sitw ^ & ^uman 

St ° f Pf otems ^ggests that this is an area 

S^if? 1 Pr ° CeSS " 1,16 human genome ' 
(W). At 4he posttranslational level, although 

2E7 i xamples of e «>^ of 2£ 

protem families involved in these modifica- 
£f er , experimental evidence £ £ 
quired to evaluate, whether this is correlated 

e^fenf n?^ Cnptl0nal Pr0cessifl g «»d the 
extent of isoform generation in the human 

clunery, further analysis will be required to 
dissect regulation at this level. " 

8 Conclusions 
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8.1 The whole-genome sequencing 
approach versus BAC by BAC 

shotgun sequencing approach to a diverse 
group of organisms with a wide range of 
genome sizesand repeat content allows us to 
assess its strengths-and 
success of the method for a large number^f 

S^JE? ^ n ° d0ubt conce ^ the 
SI | S meth ° d ^ lar * e of 
microbial genomes that have been sequenced 

SST"^ fien ° mes can be fenced 
efficiency without any input other that the de 
novo mate-paired sequences. With more 
complex genomes like those of DrosophUal 
iTS * the form of well 

ordered markers, has been critical for long. 

lolds into chromosomes, the quality of the 
map (in terms of the order ofS k 11 

£en Lr^ f *" couId ^ve 

been performed concurrently with sequenc- 
mg, fce prior existence of mapping date was 
beneficial. During the sequencmg^meT 
thaluna genome, se^ncmg 0 f\diSU 
* AC cIones Permitted extension of the se- 
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...... quence.weli into centromeric .regions and ak /^predicting genes should limit.this number/ As 

-;■ -..v Mowed high-quality resolution of complex re-;>«wa$ trueat.the beginning of genome sequenc- 
.; : '/peat regions. Likewise, in^Drosopkila,-. ±t x ^mg 9 ultimately.it will be necessary, to measure 
vH/VBAC.. physical map was most useful in. re- . : .: >niRNA:m;specific .cell types to 
fygions near the highly repetitive centromeres; ■ the presence of a gene. .: , 
' ■■ and telomeres. WGA has been found to de- ,\ ^J. B^S.Haldane speculated in 1937;that a 
, ^ .liver excellent-quality reconstructions ; of ^ Ae.,^,population of prganisms might, have to pay a 
; /v,i unique regions of the genome. As .foe genome.^ 

Vsize, and more importantly the repetitive con- ^carry He theorized that when the number of 
# tent, increases, the WGA approach. delivers i^genes ,becomes .too. jarge^ each zy^te carries 
less of the .repetitive sequence so many new deleterious mutations that the 

■V The cost and overall efficiency of clone-by- population simply cannot: maintain itself. On 
;vV f clone approaches;makes them difficult to justify ,;.:the basis .of this premise, and on the basis of 
; .- as a available mutation rates, and x-ray-induced 

; genome-sequencing projects. Specific applica- .i >■■ mutations at specific loci, duller, in A961 
v . .;, tions.ofBAC-basedoromerclon^ 
\^ sequencing strategies to resolve ambiguities m.^ v ^nome would contain a maximum of not much > 
: sequence; assembly that .cannot be efficiently-^ more man 30,000 genes(755).^esnjiiate of 
resolved, with computational approaches alone ^ 30,000 gene loci for humans was also arrived 
;■■ are clearly worth exploring. Hybrid approaches : j^at by Crow and Kimu^ ^ 
■ to whole-genome sequencing will only work if / 
there is sufficient coverage in both the whole- 

• .genome. shotgun phase and me BAC clone se-,- 
quencing phase.. Our experience with human 
genome assembly suggests that this will require 
. at least 3 X coverage of both whole-genome and 
. BAC shotgun sequence data. . 



8.2 The low gene number in humans 

We have sequenced and assembled -95% of 
the euchromatic sequence of H. sapiens and 
• used a new automated gene prediction meth 



mate for D. melanogasterwzs 1 0,000 genes, ? 
y A compared to 13,000 derived by: annotation of 
( the fly genome (25, 27)-These arguments for . 
« the theoretical maximum gene number were 
based on. simplified ideas of genetic load — 
. that all \ genes . have a certain low rate of 
.mutation to a deleterious state. However, it is 
clear that many mouse, fly, worm, and yeast 
knockout mutations lead to almost no dis- 
cernible, phenotypic perturbations. 

: : The . . modest - number of * human . genes • 
means that . we must look, elsewhere for the : 



od to produce a preliminary catalog of, the ^jnecharism^ 

Vlllmon nanan TV. I- V~ _ J 1 • '.mm .'•'«• . . " ' 



human genes. This has provided a major sur-. 
prise: We have found far fewer genes (26,000 
to 38,000) than the earlier molecular pre- 
dictions (50,000 to over 140,000). Whatever , 
the reasons for this current disparity, only 
detailed annotation, comparative genomics 
(particularly using the Mus musculus ge- w 
nome), and careful molecular dissection of 
complex phenotypes will clarify this critical 
issue of the basic "parts list" of our genome. 
Certainly, the analysis is still incomplete and 
considerable refinement will "occur in the 
years to come as the precise structure of each 
transcription unit is evaluated. A good place 
to start is to determine why the gene esti- 
mates derived from EST data are so discor- 
dant with our predictions. It is likely that the 
. following contribute to an inflated gene num- 
ber derived from ESTs: the variable lengths 
of 3'- and 5'-untranslated leaders and trailers; 
. the little-understood vagaries of RNA pro- 
cessing that often leave intronic regions in an 
unspliced condition; the finding that nearly 
40% of human genes are alternatively spliced 
(153); and finally, the unsolved technical 
problems in EST library construction where 
contamination from heterogeneous nuclear 
RNA and genomic DNA are not uncommon. 
Of course, it is possible that there are genes 
that remain unpredicted owing to the absence 
of EST or protein data to support them, al- 
though our use of mouse genome data for 



^ inherent, in human development and the so- 
phisticated signaling systems that maintain 
. homeostasis. /.There are a large number of 
ways in- which the functions of individual : 
genes and gene products are regulated. . The 
- degree of "openness" of chromatin structure : 
x .and hence transcriptional activity is regulated 
i by : protein complexes that- involve histone 
and DNA enzymatic modifications. We enu- 
merate many of the proteins that are likely 
involved in nuclear regulation in Table 19. 
The location, timing, and quantity of tran- 
scription are intimately linked to nuclear sig- 
nal transduction events as well as by the 
tissue-specific expression of many of these 
proteins. Equally important are regulatory 
DNA elements that include insulators, re- 
peats, and endogenous viruses (157); meth- 
ylation of CpG islands in imprinting (158); 
and promoter-enhancer and iritronic regions 
mat modulate transcription. The spliceosomal 
machinery consists of multisubunit proteins 
(Table 19) as well as structural and catalytic 
RNA elements (159) that regulate transcript 
structure through alternative start and termi- 
nation sites and splicing. Hence, there is a 
need to study different classes of RNA mol- 
ecules (160) such as small nucleolar RNAs, 
antisense riboregulator RNA, RNA involved 
in X-dosage compensation, and other struc- 
tural RNAs to appreciate their precise role in 
regulating gene expression. The phenomenon 



'^itin^/m^^hich coding changes 
^■pccSjii-.dk of mRNA is of 

' : clinical and biological relevance (161). Final- 
ly,.' examples 1 of transiational control include. 
^ ;internar ribosomal entry sites that are found 
: r riM : proteins .involved ..in. cell cycle regulation 
yftr anil , apoptosis ](162yjyM . the v : protein . level, ' 
, vvnunor ^alterations in the nature ;of protein- 
protein .^interactions,* protein modifications, 
V ;§ and localization can have [dramatic effects on 
^xeUula^physiolbgy (163):fThis dynamic sys- 
■ : .tem therefore ^ -has /many , ways = to modulate 
. activity, which suggests that * definition of 
, complex systems by analysis of single genes 
v ; ; is unlikely to be entirely successful, 
•f; . ;. '.In situ studies have ; shown that the human 

.genome Ois r :asyn^etrically. populated with v 
k vG^C^contehi^Cpte island^' and genes (68): 
r w.Howe^ quite 
\f. -as une4ually\as had been" predicted (Table 9) 
>; (69):] The most* G-f C-rich fraction of the ge- 
, nome, ,H3 isocliores, ^- constitute vmore of the 
. v. genome 'than previously thought (about 9%), ' 
. . and are the most gene-dense . fraction, but 
: r contain only 25% of the genes, rather than the 
? predicted -40%. The low G+C L isochores 
. make up 65% of the genome, and 48% of the 
. genes. This inhomogeneity, the net result of 
, millions of years of mammalian gene dupli- 
: cation, has been .described as; the "desertifi- 
^ cation"; of the .vertebrate, genome (71). Why 
^.are. there .clustered 1 regions of ;high and low 
■k gene dehsity^arid -are .these accidents of his- 
f ') tory or driven by selection and evolution? If 
these deserts are dispensable, it ought to be 
. possible to find mammalian genomes that are 
far smaller in. size than the human genome. 
Indeed, many vspecies of bats have genome 
: - sizes that are much smaller than that of hu- 
vmans; : for. example, Miniopterus, a species of 
rltalian'bat, has a genome size that is only 
50% that of humans (164), Similarly, Mun- 
tiacus, a species of Asian barking deer, has a 
genome size that is -r70% that of humans. 

8.3 Human DNA sequence variation 
and its distribution across the genome 

This is the first eukaryotic genome in which a 
nearly uniform ascertainment of polymorphism 
has been completed Although we have identi- 
fied and mapped more than 3 million SNPs, this 
by no means implies that the task of finding and 
cataloging SNPs is complete. These represent 
only a fraction of the SNPs present in the 
human population as . a whole. Nevertheless, 
this first glimpse at genome-wide variation has 
revealed strong inhomogeneities in the distribu- 
tion of SNPs across the genome. Polymorphism 
in DNA carries with it a snapshot of the past 
operation of population genetic forces, includ- 
ing mutation, migration, selection, and genetic 
drift The availability of a dense array of SNPs 
will allow questions related to each of these 
factors to be addressed on a genome-wide basis. 
SNP studies can establish the range of haplo- 
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rypes present in subjects of different ethnogeo- 
graphic ongins, providing insights into popula- 

S,S iy ,f nd migrati0n V**™. Although 
such studies have suggested that modem human 
■toeages denve from Africa, many important 
. questions regarding human origins, remain un- 

■■'■SffSli^ P ore '. ana Jj^^W detailed 
SNP maps will be needed to settle these con- 
^versies.Inaddition to providing evidence for 
Popul^on expansions, migration, and admix-! 
. . toe, SNPs.can serve as markers . for the extent 
V . o fevol "*>nary constraint acting ^ on particular 
genes. The correlation between patterns of in- 
and .interspecies genetic ..variation 
maypmve^be especiaUy mformative to iden- . 
• • tify s,tes of reduced genetic diversity that may 
£a*faci where sequence variations are not 

The remarkable heterogeneity in SNP 
density. unplies .that there, are aivariety of 
; / forces acting on polymorphism-sparse re- 
gions may have lower SNP density because ' 
-. the mutation rate is lower, because most of 
those regions have a lower fraction of muta- 
nons that are tolerated, or because recent 
stiong^ se i ecnon in favor .of a riewly arisen 
allele "swept" the linked variation out of the 
population (16S). The effect of random ge- 
netic drift also varies widely across the ge- 
nome. The nonrecombining portion of the Y 

from random drift because there are roughly 
one-quarter as many Y chromosomes in the 
population as there are autosomal chromo-' 
somes, and the level of polymorphism on the 
; r is correspondingly less. Similarly, the X 
chromosome has a smaller effective popu- ' 
lation sue than the autosomes, and its nu- 
cleotide diversity is also reduced. But even 
across a single autosome, the effective pop- 
ulation size can vary because the density of 
deleterious mutations may vary. Regions of 
high density of deleterious mutations will " 
see a greater rate of elimination by selec- 
tion, and the effective population size will 
be smaller (166). As a result, the density of 
even completely neutral SNPs will be lower 
m suc h regions -There is a large literature 
on the association between SNP density 
and local recombination rates m Drosoph- 
Ma, and it remains an important task to 
assess the strength of this association in the 
human genome, because of its impact on 
the design of local SNP densities for dis- 
ease-association studies. It also remains an 
important task to validate SNPs on a 
genomic scale in order to assess the degree 
of heterogeneity among geographic and 
ethnic populations. 
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8.4 Genome complexity 

We will soon be in a position to move away 
from the cataloging of individual compo- 
nents of the system, and beyond the sim- 
plistic notions of "this binds to that, which 



then docks on this, and then the. complex 
moves there. . . ." (i 67 ) to the exciting area 
,,.of. : network,., perturbations, . nonlinear re- 
^ sponses .and.rthresholds,,and itheir .pivotal \ 

role. in human'diseases. . . .. , ..o! : 
i>.v, .-The. enumeration of ^other^'parts lists*" re ' • 
^■Ayejs.mat.in organisms.with complex nervous \ 
: u systen^ neither gene number,; neuron number - : 

.0 nor.:number °f cell . types rcorreW: in -any'^ 
.!,:,meanmgml;mamier;:wiA 

..Nor.would.^eybeiexpected to; this is'me realm t' 

' epigenesis (/<y*).,The 520 

^on neuron^ 

: < , the neuronal number fa the brain of a mouse by 
■ an order of magnitude. It is apparent from a 
v .companson of genomic data on the mouse and .' 
,; Human, and .fajm.comparative mammalian neu- 
•>»«tt^x^t«»t,fte morphological and v. 
^behavioral diversity found jn.mammals is' un- 
, r.derpinned by a similar gene.repertoire and sim- ! K 
./J* neuroanatomies.;. Forsexample,. when one - 
, :.f . compares a pygmy marmoset (which is only 4 i 
r ; inches. : taU,and.weighs.about 6 ; ounces) to a - 
, ; chimpanzee, the brain volume of. tliis minute 
v. .prmiate is found. to be only about 1.5 cm 3 two 
.. orders of magnitude less than that of a chimp , 
,..and three orders less than that of humans Yet 
the neuroanatomies of all three brains are strik- ' 
• ;ingly similar, and the; behavioral characteristics 
of the pygmy marmoset are little different from 
those of chimpanzees. Between humans and 
crumpanzees, the gene number, gene structures 
and funchons, chromosomal and genomic or- 
; .ganizaUons, and cell types and neuroanatomies , 
are .almost indistinguishable, yet the develop- . 
mental modifications, that predisposed human 
■ hneages to cortical expansion and development 0 
of the larynx, giving rise to language, culminat- 
ed m a massive singularity that by even the 
simplest of criteria made humans more com- 
plex in a behavioral sense. -- --- w', 

■M Simple examination of the numberof neu- 
rons cell;types, or genes or of the genome " ' 
size does not alone account for the differenc- 
es in complexity that we observe. Rather it is 
the interactions within and among these sets 
that result fa such great variation, fa addition 
it is possible that there are "special cases" of 
regulatory gene networks that have a dispro- 
portionate effect on the overall system We 
have presented several examples of "regula- 
tory genes" that are significantly increased fa 
the human genome compared with the fly and 
worm These include extracellular ligands 
and faeir cognate receptors (e.g., wnt, friz- 
zled, TGF-B, ephrfas, and connexins), as well 
as nuclear regulators (e.g., the KRAB and 
nomeodomafa transcription factor families) 
where a few proteins control broad develop-' 
mental processes. The answers to these 
complexities" perhaps lie fa these expanded 
gene families and differences fa the regulato- 
ry control of ancient genes, proteins, path- 
ways, and cells. y 
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8.5 Beyond single components 

While fe W W o U i d disagree with the intuitive 
• conclusion .-tiiat.Einstein^ .brain was more 

,;-pansons such as whethe? the set of predicted 
■vhumaniprotems 'is.moreVcomplex than the 
protein set ; of Drosophilal and if so, ,o wha! 

degree,.are in ot straightforward, since protein 
.rrproteiftdomain, or-protefaprotein interaction 

:;; measure §: dq>npt>:capture.;cori'text-dependem 
•^mteractiohs^ 

. . derlymg.phenotype. v " ■•" •"V'-v- - J 

v. Currently, there are more ! athan30 different ' 
^.mathematical descriptions.of coinplexity (170) 
•-However, we. have yet to understand the math- 
, ematical ! dependency, relating the number of 
genes with organism complexity. One praemat- 
>ic. approach^ the analysis.of bioIogLl^ys- 
: terns, which are composed of nonidentical ele-' 
- ; ments (protems/pretem com^lexes, interacting •/ : 
i«U> types^and =fateractm g ; neuronal popula- 
-Vhons),=is;through>graph theofy (171) The Tele 
ments of the system can be represented by the ' 
..-.vertices of complex topographies, with the edg- 
i ;v^ . representing the interactions between them. 
Examination of large networks' reveals that they 
cm self-organize, but more important they can 
< be particularly robust .This ;robustness is not 

due to redundancy, but is a property of inho- 
mogeneously wired networks. The error toler- 
ance of such networks comes with a price- thev 
are vulnerable to the selection or removal of a 
; .-.few nodes that contribute disproportionately to 
.network stabihty. Gene.knockouts provide fan ' 
uTustraftoa Some knockouts may have minor 
-.effects, whereas others have catastrophic effects 
- on the system. :fa the case of vimentin, a sup- 
posedly critical component of the cytoplasmic 
intermediate filament network of mammals, the 
knockout of the gene in mice reveals them to be 
reproductively normal, with no obvious pheno- 
typic effects (i72),and yet meusually conspic- 
uous.. vimentin network' is' completely absent 
O^ &e other band, l~30% bf knockouts in 
Drosophila and mice correspond to critical 
nodes whose reduction in gene product, or total 
etamnation, causes the network to crash most 
of the time, although even in some of these 
cases, phenotypic normalcy ensues, given the 
appropriate genetic background Thus, there are 
no good" genes or "bad" genes, but only net- 
works that exist at various levels and at differ- 
ent connectivities, and at different states of 
sensitivity to perturbation. Sophisticated math- 
ematical analysis needs to be constantly evalu- 
ated against hard biological data sets that spe- 
cifically address network dynamics. Nowhereis 
this more critical than in attempts to come to 
gnps with "complexity," particularly because 
deconvoluting and correcting complex net- 
works that have undergone perturbation, and 
have resulted in human diseases, is the greatest 
significant challenge now facing us. 

It has been predicted for the last 15 years 
that complete sequencing of the human ge- 
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, nome would open up new strategies for hu- 
'.y ;* man biological research and would have a 
> major impact on medicine, and through med- 
: icine and public.health, on society. Effects on 
> biomedical research are already being felt 
,. : / This assembly of the human genome se- 
; ; :quence.is but a.first, hesitant. step on a long. 
... :-. .and ..exciting journey /toward . .understanding! 
• • - the role of the genome in human biology! It 
.has been possible only because of innova- 
'■ I .tions.in instrumentation and software that 
'.. .have allowed automation of almost every step 
,;. of the process from DNA preparation to an- 
, . notation. The next steps are clear: We must ' 
. ." define the complexity that ensues when this 
relatively modest set of about 30,000 genes is 
.. expressed. The sequence provides the frame- 
; . .work upon which all the genetics, biochem- * 
■ : istry, physiology, and ultimately phenotype ' 
depend It provides the boundaries for scien- 
tific inquiry. The sequence is only the first '■■ 
level of. understanding of the genome. All ■ 
genes and their control elements . must be 
identified; their functions, in concert as well 
as in isolation, defined; their sequence varia- 
tion worldwide described; and the relation 
..between genome variation and specific phe- 
notypic characteristics determined. Now we 
know what we have to explain. 

Another paramount challenge awaits: 
. public discussion of this information and its 
. \ potential for improvement of personal health. 
.Many diverse sources of data have shown " 
that any two individuals are more than 99.9% 
identical in sequence, which means that , all 
the glorious differences among individuals in 
our species that can be attributed to genes 
falls in a mere 0.1% of the sequence. There 
are two fallacies to be avoided: determinism, 
the idea that all characteristics of the person 
, are /'hard-wired" by the genome; and reduc- 
tionism, the view that with complete knowl- 
edge of the human genome sequence, it is 
only a matter of time before our understand- 
ing of gene functions and interactions will 
provide a complete causal description of hu- 
man variability. The real challenge of human 
biology, beyond the task of finding out how 
genes orchestrate the construction and main- 
tenance of the miraculous mechanism of our 
bodies, will lie ahead as we seek to explain 
how our minds have come to organize 
thoughts sufficiently well to investigate our 
own existence; 
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? ;': .. J :'::L?^?l 1nt0 ? st - x| -! J nearized>lasmid vector with 

• /•••: 3 -TGTG overhangs; Libraries with three different 

• ;. v.x.; average sizes of inserts ;were constructed: 2, lO.and 
_ ^ . ; 50-kbp. ; ,The. 2-kbp fragments were cloned in a 

>;.<:> r high-copy .pUCI 8 derivative. The 10- and 50-kbp 
^ ^. ; v ..fragments were cloned in a medium-copy pBR322 
- / derivative. The 2-; and ilO-kbp. libraries yielded uni- 
v'r v; x y v form-sized large colonies on plating.'. However, the 
,;• \, : J-., ■ ■ 50rkbp. librarfes produced many, smalLcolohies and 
-/H; inserts, were . unstable^To , remedy, this,': the ; 50-kbp 
^iy^^libraries.were^igested^ith BglJ^whlch'does nor 
- , .^:Cleave : the ^vector. v but generally ;cle'aVed : several 
. vi times within the 50-kbp. insert A 1 264^bp' Barn HI : 
w • rkanamycin ■ resistance /^.cassette '.(purified . from 
' : pUCK4; Amersham Pharmacia, catalog rio. 27-4958- 
\ . . . 01) was added and ligation.was carried out at 37°C 
. in the continual presence of Bgl II. As Bgl li-Bgl it 
. .. ligations occurred, they were continually. cleaved 
. A whereas Bam Hl-Bgl II ligations were not cleaved. A 
: yield of Jntemally deleted circular,library mol- " 
. k\ ecules was obtained Jn which ?the residual Insert 
.:v..:..V^ ends,.were .separated, by. the Jkanamydn cassette ■ 
• : V DNA.. The: Internally deleted libraries, when plated ■ 
/ . ; ...,'.,on agar containing ampicillin (50 >g/ml), carbeni- 
dUin ( 50 ^g/ ml )» and kanamycln (15 p.g/ml), pro- • 
v d "«d relatively uniform large colonies. The result- ' 
/ v .y ing clones could be prepared^or sequencing using 
■ t \ . the same procedures as clones from the 10-kbp 

• -. libraries. * . 
> 34. Transformed cells were plated on agar diffusion 
plates prepared with.a fresh top layer containing no 
..... antibiotic poured on top of a previously set bottom 
-■*■■ layer. containing excess antibiotic, to achieve the 
correct final concentration.. This method of plating 
permitted the cells to develop antibiotic resistance 
before being exposed to antibiotic without the po- 
tential clone bias that can be introduced through 
. liquid outgrowth protocols. • After colonies had 
, grown. QBot (Genetix, UK) automated colony-pick- 
. ; : . s ing robots were used to pick colonies meeting strin- . 
v gent size and shape criteria and to. Inoculate 384- 

• • weU.microtiter plates containing liquid growth me- 
Vadium.. Liquid, cultures were .incubated overnight, ■ 

, / .with shaking, and were. scored for growth before 
... . .. passing to template preparation: Template DNA was 

. extracted from liquid bacterial culture using a pro- 
, : ,cedure based upon the alkaline lysis miniprep meth- 
od (773) adapted for high throughput processing in 
; 384-well miCTotiter plates/ Bacterial cells were 
vlysed; cell debris was removed by centrifugation; 
; and plasmid DNA was recovered by isopropanol 
■■(X. precipitation t and ,resuspended .in. 10 mM tris-HCl 
V v buffer. Reagent dispensing operations were accom- 
plished using Titertek MAP 8 liquid dispensing sys- 
tems. Plate-to-plate liquid transfers were performed 
using Tomtec Quadra 384 Model 320 pipetting ro- 
bots. All plates were tracked throughout processing 
by unique plate barcodes. Mated sequencing reads 
from opposite ends of each clone Insert were ob- 
tained by preparing two 384-well cycle sequencing 
reaction plates from each plate of plasmid template 
. DNA using ABI-PRISM BigDye Terminator chemistry 
(Applied Biosystems) and standard M 13 forward 
and reverse primers. Sequencing reactions were pre- 
pared using the Tomtec Quadra 384-320 pipetting 
robot Parent-child plate relationships and, by ex- 
tension, forward-reverse sequence mate pairs were 
established by automated plate barcode reading by 
the onboard barcode reader and were recorded by 
direct UMS communication. -Sequencing reaction 
products were purified by alcohol precipitation and 
were dried, sealed, and stored at 4°C in the dark 
until needed for sequencing, at which time the 
reaction products were resuspended In deionlzed 
formamide and sealed immediately to prevent deg- 
radation. All sequence data were generated using a 
single sequencing platform, the ABI PRISM 3700 
DNA Analyzer. Sample sheets were created at load 
time using a Java-based application that facilitates 
barcode scanning of the sequencing plate barcode, 
retrieves sample Information from the central UMS, 
and reserves unique trace Identifiers. The applica- 
tion permitted a single sample sheet file in the 
linking directory and deleted previously created 
sampte sheet files Immediately upon scanning of a 
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share at least one significant BLAST hit In common. 
,This is an especially interesting property of the 
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complete dusters, because it is impossible to place 
a unique multidomain protein into a complete dus- 
ter. Thus, the! single-Unkage and complete clusters, 
plus singletons should comprise a lower and upper 
bound of sizes of core protein sets, respectively, 
allowing us to compare the relative size and com- 
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